Why is it taking longer to create an index after column size increases?

sql-server add tag

Daniel Miller (imported from SE)

Our vendor changed column widths on almost every column in the entire database.  The database is around 7TB, 9000+ tables. We are trying to create an index on a table that has 5.5billion rows. Before the vendor's upgrade we could create the index in 2 hours. Now it takes days. What they have done is increase any varchar(xx) size to varchar(256).  So most columns used to be varchar(18) or varchar(75), etc.

Anyway the primary key consists of 6 columns that combined width was 126 characters. Now after the upgrade, the primary key is 1283 characters which violates SQL Servers limit of 900 characters.  The entire table column width went from a total combined varchar count of 1049 to a total combined varchar count of 4009.

There is not an increase in data, the table doesn't take up any more "space" than it did before all the column width increase, but performance to create something as simple as an index is now taking an unreasonable amount of time.

Can anyone explain why it is taking so much longer to create and index when the only thing done was increasing the size of the columns?

The index we are trying to create is nonclustered since the pk is the clustered index.  After several attempts to create the index, we gave up. I think it ran 4 or 5 days without completion.

I tried this in a non-production environment by taking a file system snapshot and brought the database up on a quieter server.

Top Answer

Geoff Patterson (imported from SE)

Remus has helpfully pointed out that the max length of the `VARCHAR` column impacts the estimated row size and therefore memory grants that SQL Server provides.

I tried to do a bit more research to expand on the "from this on things cascade" part of his answer. I don't have a complete or concise explanation, but here is what I found.

**Repro script**

[I created a full script][1] that generates a fake data set on which index creation takes roughly 10x as long on my machine for the `VARCHAR(256)` version. The data used is exactly the same, but the first table uses the actual max lengths of `18`, `75`, `9`, `15`, `123`, and `5`, while all columns use a max length of `256` in the second table.

**Keying the original table**

Here we see that the original query completes in about 20 seconds and the logical reads are equal to the table size of `~1.5GB` (195K pages, 8K per page).

	-- CPU time = 37674 ms,  elapsed time = 19206 ms.
	-- Table 'testVarchar'. Scan count 9, logical reads 194490, physical reads 0
	CREATE CLUSTERED INDEX IX_testVarchar
	ON dbo.testVarchar (s1, s2, s3, s4)
	WITH (MAXDOP = 8) -- Same as my global MAXDOP, but just being explicit
	GO

**Keying the VARCHAR(256) table**

For the `VARCHAR(256)` table, we see that the elapsed time has increased dramatically. 

Interestingly, neither the CPU time nor the logical reads increase. This makes sense given that the table has the exact same data, but it doesn't explain why the elapsed time is so much slower.

	-- CPU time = 33212 ms,  elapsed time = 263134 ms.
	-- Table 'testVarchar256'. Scan count 9, logical reads 194491
	CREATE CLUSTERED INDEX IX_testVarchar256
	ON dbo.testVarchar256 (s1, s2, s3, s4)
	WITH (MAXDOP = 8) -- Same as my global MAXDOP, but just being explicit
	GO

**I/O and wait stats: original**

If we capture a bit more detail (using [p_perfMon, a procedure that I wrote][2]), we can see that the vast majority of the I/O is performed on the `LOG` file. We see a relatively modest amount of I/O on the actual `ROWS` (the main data file), and the primary wait type is `LATCH_EX`, indicating in-memory page contention.

We can also see that my spinning disk is somewhere between "bad" and "shockingly bad", [according to Paul Randal][3] :)

[![enter image description here][4]][4]

**I/O and wait stats: VARCHAR(256)**

For the `VARCHAR(256)` version, the I/O and wait stats look completely different! Here we see a huge increase in the I/O on the data file (`ROWS`), and the stall times now make Paul Randal simply say "WOW!".

It's not surprising that the #1 wait type is now `IO_COMPLETION`. But why is so much I/O generated?

[![enter image description here][5]][5]

**Actual query plan: VARCHAR(256)**

From the query plan, we can see that the `Sort` operator has a recursive spill (5 levels deep!) in the `VARCHAR(256)` version of the query. (There is no spill at all in the original version.)

[![enter image description here][6]][6]

**Live query progress: VARCHAR(256)**

We can [use sys.dm_exec_query_profiles to view live query progress in SQL 2014+][7]. In the original version, the entire `Table Scan` and `Sort` are processed without any spills (`spill_page_count` remains `0` throughout).

In the `VARCHAR(256)` version, however, we can see that page spills quickly accumulate for the `Sort` operator. Here is a snapshot of the query progress just before the query completes. The data here is aggregated across all threads.

[![enter image description here][8]][8]

If I dig into each thread individually, I see that 2 threads complete the sort within about 5 seconds (@20 seconds overall, after 15 seconds spent on the table scan). If all threads progressed at this rate, the `VARCHAR(256)` index creation would have completed in roughly the same time as the original table.

However, the remaining 6 threads progress at a much slower rate. This may be due to the way that memory is allocated and the way that the threads are being held up by I/O as they are spilling data. I don't know for sure though.

[![enter image description here][9]][9]

**What can you do?**

There are a number of things you might considering trying:

* Work with the vendor to roll back to a previous version. If that's not possible, let the vendor that you are not happy with this change so that they can consider reverting it in a future release.
* When adding your index, consider using `OPTION (MAXDOP X)` where `X` is a lower number than your current server-level setting. When I used `OPTION (MAXDOP 2)` on this specific data set on my machine, the `VARCHAR(256)` version completed in `25 seconds` (compared to 3-4 minutes with 8 threads!). It's possible that the spilling behavior is exacerbated by higher parallelism.
* If additional hardware investment is a possibility, profile the I/O (the likely bottleneck) on your system and consider using an SSD to reduce the latency of the I/O incurred by spills.

**Further reading**

Paul White has a nice blog post on [the internals of SQL Server sorts][10] that may be of interest. It does talk a little bit about spilling, thread skew, and memory allocation for parallel sorts.


  [1]: https://gist.github.com/anonymous/71ee72101c806441676f
  [2]: https://gist.github.com/anonymous/ec12170fbe9b76224fae
  [3]: http://www.sqlskills.com/blogs/paul/are-io-latencies-killing-your-performance/
  [4]: https://i.stack.imgur.com/hIhBS.png
  [5]: https://i.stack.imgur.com/UJ2v8.png
  [6]: https://i.stack.imgur.com/bASQk.png
  [7]: https://dba.stackexchange.com/questions/113124/query-is-slow-in-sql-server-2014-fast-in-sql-server-2012/113126#113126
  [8]: https://i.stack.imgur.com/GG3yq.png
  [9]: https://i.stack.imgur.com/QkLu7.png
  [10]: http://sqlperformance.com/2015/04/sql-plan/internals-of-the-seven-sql-server-sorts-part-1

Answer #2

Remus Rusanu (imported from SE)

The intermediate sort table will be differently estimated between the two cases. This will lead to different memory grant requests (`VARCHAR(256)` will be bigger) and likely a much smaller actual grant, percent wise, compared with the 'ideal' request. I guess this leads to spills during sort. 

Testing the script from Geoff (on 100k rows only) I can clearly see difference in the sort estimated row size (141B vs. 789B). From this on things cascade.

2 Answers