Why can it take up to 30 seconds to create a simple CCI rowgroup? - Databases

TopAnswers Databases

Meta

Databases

TeX

Code Golf

APL

C++

.net

db<>fiddle

Java

*nix

PHP

PowerShell

Python

Rust

टेक्-मराठी

Typst

Web Client Dev

Web Server Dev

Why can it take up to 30 seconds to create a simple CCI rowgroup?

sql-server sql-server-2016 add tag

Joe Obbish (imported from SE)

I was working on a demo involving CCIs when I noticed that some of my inserts were taking longer than expected. Table definitions to reproduce:

    DROP TABLE IF EXISTS dbo.STG_1048576;
    CREATE TABLE dbo.STG_1048576 (ID BIGINT NOT NULL);
    INSERT INTO dbo.STG_1048576
    SELECT TOP (1048576) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
    FROM master..spt_values t1
    CROSS JOIN master..spt_values t2;

    DROP TABLE IF EXISTS dbo.CCI_BIGINT;
    CREATE TABLE dbo.CCI_BIGINT (ID BIGINT NOT NULL, INDEX CCI CLUSTERED COLUMNSTORE);

For the tests I'm inserting all 1048576 rows from the staging table. That's enough to fill exactly one compressed rowgroup as long as it doesn't get trimmed for some reason.

If I insert all of the integers mod 17000 it takes less than a second:

    TRUNCATE TABLE dbo.CCI_BIGINT;
    
    INSERT INTO dbo.CCI_BIGINT WITH (TABLOCK)
    SELECT ID % 17000
    FROM dbo.STG_1048576
    OPTION (MAXDOP 1);

> SQL Server Execution Times: CPU time = 359 ms,  elapsed time = 364 ms.

However, if I insert the same integers mod 16000 it sometimes takes over 30 seconds:

    TRUNCATE TABLE dbo.CCI_BIGINT;
    
    INSERT INTO dbo.CCI_BIGINT WITH (TABLOCK)
    SELECT ID % 16000
    FROM dbo.STG_1048576
    OPTION (MAXDOP 1);

> SQL Server Execution Times: CPU time = 32062 ms,  elapsed time = 32511 ms.


This is a repeatable test that has been done on multiple machines. There seems to be a clear pattern in elapsed time as the mod value changes:

    MOD_NUM	TIME_IN_MS
    1000	2036
    2000	3857
    3000	5463
    4000	6930
    5000	8414
    6000	10270
    7000	12350
    8000	13936
    9000	17470
    10000	19946
    11000	21373
    12000	24950
    13000	28677
    14000	31030
    15000	34040
    16000	37000
    17000	563
    18000	583
    19000	576
    20000	584

If you want to run tests yourself feel free to modify the test code that I wrote [here][1].

I couldn't find anything interesting in sys.dm_os_wait_stats for the mod 16000 insert:

    ╔════════════════════════════════════╦══════════════╗
    ║             wait_type              ║ diff_wait_ms ║
    ╠════════════════════════════════════╬══════════════╣
    ║ XE_DISPATCHER_WAIT                 ║       164406 ║
    ║ QDS_PERSIST_TASK_MAIN_LOOP_SLEEP   ║       120002 ║
    ║ LAZYWRITER_SLEEP                   ║        97718 ║
    ║ LOGMGR_QUEUE                       ║        97298 ║
    ║ DIRTY_PAGE_POLL                    ║        97254 ║
    ║ HADR_FILESTREAM_IOMGR_IOCOMPLETION ║        97111 ║
    ║ SQLTRACE_INCREMENTAL_FLUSH_SLEEP   ║        96008 ║
    ║ REQUEST_FOR_DEADLOCK_SEARCH        ║        95001 ║
    ║ XE_TIMER_EVENT                     ║        94689 ║
    ║ SLEEP_TASK                         ║        48308 ║
    ║ BROKER_TO_FLUSH                    ║        48264 ║
    ║ CHECKPOINT_QUEUE                   ║        35589 ║
    ║ SOS_SCHEDULER_YIELD                ║           13 ║
    ╚════════════════════════════════════╩══════════════╝

Why does the insert for `ID % 16000` take so much longer than the insert for `ID % 17000`?


  [1]: https://pastebin.com/qXPhYFfE

Top Answer

Paul White (imported from SE)

In many respects, this is expected behaviour. Any set of compression routines will have widely ranging performance depending on input data distribution. We expect to trade data loading speed for storage size and runtime querying performance.

There is a definite limit to how detailed an answer you're going to get here, since VertiPaq is a proprietary implementation, and the details are a closely-guarded secret. Even so, we do know that VertiPaq contains routines for:

* Value encoding (scaling and/or translating values to fit in a small number of bits)
* Dictionary encoding (integer references to unique values)
* Run Length Encoding (storing runs of repeated values as [value, count] pairs)
* Bit-packing (storing the stream in as few bits as possible)

Typically, data will be value or dictionary encoded, then RLE or bit-packing will be applied (or a hybrid of RLE and bit-packing used on different subsections of the segment data). The process of deciding which techniques to apply can involve generating a histogram to help determine how maximum bit savings can be achieved.

Capturing the slow case with Windows Performance Recorder and analyzing the result with Windows Performance Analyzer, we can see that the vast majority of the execution time is consumed in looking at the clustering of the data, building histograms, and deciding how to partition it for best savings:

[![WPA Analysis][1]][1]

The most expensive processing occurs for values that appear at least 64 times in the segment. This is a heuristic to determine when *pure* RLE is likely to be beneficial. The faster cases result in *impure* storage e.g. a bit-packed representation, with a larger final storage size. In the hybrid cases, values with 64 or more repetitions are RLE encoded, and the remainder are bit-packed.

The longest duration occurs when the maximum number of distinct values with 64 repetitions appear in the largest possible segment i.e. 1,048,576 rows with 16,384 sets of values with 64 entries each. Inspection of the code reveals a hard-coded time limit for the expensive processing. This can be configured in other VertiPaq implementations e.g. SSAS, but not in SQL Server as far as I can tell.

Some insight into the final storage arrangement can be acquired using the [undocumented `DBCC CSINDEX` command][2]. This shows the RLE header and array entries, any bookmarks into the RLE data, and a brief summary of the bit-pack data (if any).

For more information, see:

* [The VertiPaq Engine in DAX][3] by Alberto Ferrari and Marco Russo
* [Microsoft Patent WO2015038442][4]: Processing datasets with a DBMS engine 
* [Microsoft Patent WO2010039898][5]: Efficient large-scale filtering and/or sorting for querying of column based data encoded structures


  [1]: https://i.stack.imgur.com/AKQ5I.png
  [2]: http://www.queryprocessor.ru/dbcc-csindex/
  [3]: https://www.microsoftpressstore.com/articles/article.aspx?p=2449192&seqNum=3
  [4]: https://patentscope.wipo.int/search/en/detail.jsf?docId=WO2015038442
  [5]: https://patentscope.wipo.int/search/en/detail.jsf?docId=WO2010039898

1 Answer