Why is it recommended to store BLOBs in separate SQL Server tables?

sql-server add tag

heinzi (imported from SE)

[This highly-upvoted SO answer](https://stackoverflow.com/a/5613926/87698) recommends to put images in separate tables, even if there is only a 1:1 relationship with another table:

> If you decide to put your pictures into a SQL Server table, I would strongly recommend using a separate table for storing those pictures - do not store the employee photo in the employee table - keep them in a separate table. That way, the Employee table can stay lean and mean and very efficient, assuming you don't always need to select the employee photo, too, as part of your queries.

Why? I was under the impression that [SQL Server only stores a pointer to some dedicated BLOB data structure](http://sqlmag.com/t-sql/varbinarymax-tames-blob) in the table, so why bother to manually create another layer of indirection? Does it really improve performance significantly? If yes, why?

Top Answer

Solomon Rutzky (imported from SE)

How large are these images, and how many do you expect to have? While I mostly agree with [@sp_BlitzErik][1], I think there are some scenarios where it is ok to do this, and so it would help to have a clearer picture of what is actually being requested here.

Some options to consider that alleviate most of the negative aspects pointed out by Erik are:

* [FILESTREAM][2] (starting in SQL Server 2008)
* [FileTables][3] (starting in SQL Server 2012)

Both of these options are designed to be a middle-ground between storing BLOBs either fully in SQL Server or fully outside (except for a string colun to retain the path). They allow for BLOBs to be a part of the data model and participate in Transactions while not wasting space in the buffer pool (i.e. memory). The BLOB data is still included in backups, which does make them take up more space and take longer to backup _and_ to restore. However, I have a hard time seeing this as a true negative given that if it is part of the app then it needs to be backed up somehow, and having only a string column containing the path is completely disconnected and allows for BLOBs files to get deleted with no indication of that in the DB (i.e. invalid pointers / missing files). It also allows for files to be "deleted" within the DB but still exist on the file system which will need to eventually be cleaned up (i.e. headache). But, if the files are HUGE, then maybe it is best to leave entirely outside of SQL Server except for the path column.

That helps with the "inside or outside" question, but does not touch on the single table vs multiple table question. I can say that, beyond this specific question, there are certainly valid cases for splitting tables into groups of columns based on usage patterns. Often when one has 50 or more columns there are some that are accessed frequently and some that are not. Some columns are written to frequently while some are mostly read. Separating frequently access vs infrequently accessed columns into multiple tables having a 1:1 relationship is quite often beneficial because why waste the space in the Buffer Pool for data you probably aren't using (similar to why storing large images in regular `VARBINARY(MAX)` columns is a problem)? You also increase the performance of the frequently access columns by reducing the row size and hence fitting more rows onto a data page, making reads (both physical and logical) more efficient. Of course, you also introduce some inefficiency by needing to duplicate the PK, and now sometimes you need to join the two tables, which also complicates (even if only slightly) some queries.

So, there are several approaches you could take, and what is best depends on your environment and what you are trying to accomplish.

---

> I was under the impression that SQL Server only stores a pointer to some dedicated BLOB data structure in the table

Not so simple. You can find some good info here, [What is the Size of the LOB Pointer for (MAX) Types Like Varchar, Varbinary, Etc?][4], but the basics are:

* `TEXT`, `NTEXT`, and `IMAGE` datatypes (by default): 16 byte pointer
* `VARCHAR(MAX)`, `NVARCHAR(MAX)`, `VARBINARY(MAX)` (by default):
* If the data can fit in the row, then it will be placed there
* If the data is less than approx. 40,000 bytes (the linked blog post shows 40,000 as the upper limit but my testing showed a slightly higher value) _AND_ if there is room on the row for this structure, then there will be between 1 and 5 direct links to LOB pages, starting at 24 bytes for the first link to the first 8000 bytes, and going up by 12 bytes per each additional link for each additional set of 8000 bytes, up to 72 bytes max.
* If the data is over approx. 40,000 bytes _OR_ there is not enough room to store the appropriate number of direct links (e.g. only 40 bytes left on the row and a 20,000 byte value needs 3 links which is 24 bytes for the first plus 12 for the two additional links for 48 bytes total required in-row space), then there will just be a 24 byte pointer to a text tree page which contains the links to the LOB pages).

[1]: https://dba.stackexchange.com/a/174680/30859
[2]: https://docs.microsoft.com/en-us/sql/relational-databases/blob/filestream-sql-server
[3]: https://docs.microsoft.com/en-us/sql/relational-databases/blob/filetables-sql-server
[4]: http://improve.dk/what-is-the-size-of-the-lob-pointer-for-max-types-like-varchar-varbinary-etc/

Answer #2

meme (imported from SE)

While I disagree that BLOBs should just be in another table -- they should [not be in the database at all][1]. Store a pointer to where the file lives on disk, and then just get that from the database...


The primary issue they cause (for me) is with indexing. Using XML with query plans, because everyone's got'em, let's make a table:


    SELECT TOP 1000
    ID = IDENTITY(INT,1,1),
    deq.query_plan
    INTO dbo.index_test
    FROM sys.dm_exec_cached_plans AS dec
    CROSS APPLY sys.dm_exec_query_plan(dec.plan_handle) AS deq

    ALTER TABLE dbo.index_test ADD CONSTRAINT pk_id PRIMARY KEY CLUSTERED (ID)


It's only 1000 rows, but [checking on the size][2]...


    sp_BlitzIndex @DatabaseName = 'StackOverflow', @SchemaName = 'dbo', @TableName = 'index_test'


It's over 40 MB for just 1000 rows. Assuming you add 40 MB every 1000 rows, that can get pretty ugly pretty quickly. What happens when you hit 1 million rows? That's just about 1 TB of data, there.


[![NUTS][3]][3]


Any queries that need to use your clustered index now need to read all of that BLOB data into memory *clarification:* when the BLOB data column is referenced.


Can you think of better ways to use SQL Server memory than storing BLOBs? Because I sure can.


Expanding it to nonclustered indexes:


    CREATE INDEX ix_noblob ON dbo.index_test (ID)
    
    CREATE INDEX ix_returnoftheblob ON dbo.index_test (ID) INCLUDE (query_plan)


You can design your nonclustered indexes to largely avoid the BLOB column so regular queries can avoid the clustered index, but as soon as you need that BLOB column, you need the clustered index. 


If you add it as an `INCLUDED` column to a nonclustered index to avoid a key lookup scenario, you end up with gigantic nonclustered indexes:[![enter image description here][4]][4]

More problems they cause:

- If anyone runs a `SELECT *` query, they get all that BLOB data.
- They take up space in backups and restores, slowing them down
- They slow down `DBCC CHECKDB`, because I know you're checking for corruption, right?
- And if you do any index maintenance, they slow that down as well.

Hope this helps!

  [1]: https://www.brentozar.com/archive/2015/03/no-more-blobs/
  [2]: http://firstresponderkit.org
  [3]: https://i.stack.imgur.com/Ug9ST.jpg
  [4]: https://i.stack.imgur.com/N1CMi.jpg

Answer #3

Joe Obbish (imported from SE)

If the data must be stored in SQL Server for whatever reason I can think of a few benefits to storing it in a separate table. Some are more convincing than others.

 1. Putting the data in a separate table means you can store it in a separate database. This can have advantages for scheduled maintenance. For example, you can run `DBCC CHECKDB` only on the database that contains the BLOB data.

 2. If you don't always put more than 8000 bytes into the BLOB then it's possible for it to [be stored in-row][1] for some rows. You may not want that because it will slow down queries that access data using the clustered index even if the column isn't needed by the query. Putting the data in a separate table removes this risk.

 3. When stored off row SQL Server uses an up to 24 byte pointer to point to the new page. That takes up space and limits the total number of BLOB columns you can add to a single table. See srutzky's answer for more details.

 4. A clustered columnstore index cannot be defined on a table containing a BLOB column. This limitation has been removed will be removed in SQL Server 2017.

 5. If you eventually decide that the data should be moved outside of SQL Server it may be easier to make that change if the data is already in a separate table.

  [1]: https://technet.microsoft.com/en-us/library/ms186981(v=sql.105).aspx

3 Answers