Strange behaviour with sample sizes for statistics updates - Databases

TopAnswers Databases

Meta

Databases

TeX

Code Golf

APL

C++

.net

db<>fiddle

Java

*nix

PHP

PowerShell

Python

Rust

टेक्-मराठी

Typst

Web Client Dev

Web Server Dev

Strange behaviour with sample sizes for statistics updates

sql-server sql-server-2012 add tag

matthew mcgiffen (imported from SE)

I've been playing around investigating sampling thresholds with statistics updates on SQL Server (2012) and noticed some curious behaviour. Basically the number of rows sampled seems to vary under some circumstances - even with the same set of data. 

I run this query:

    --Drop table if exists
    IF (OBJECT_ID('dbo.Test')) IS NOT NULL DROP TABLE dbo.Test;
    
    --Create Table for Testing
    CREATE TABLE dbo.Test(Id INT IDENTITY(1,1) CONSTRAINT PK_Test PRIMARY KEY CLUSTERED, TextValue VARCHAR(20) NULL);

    --Insert enough data so we have more than 8Mb (the threshold at which sampling kicks in)
    INSERT INTO dbo.Test(TextValue) 
    SELECT TOP 1000000 'blahblahblah'
    FROM sys.objects a, sys.objects b, sys.objects c, sys.objects d;  

    --Create Index on TextValue
    CREATE INDEX IX_Test_TextValue ON dbo.Test(TextValue);

    --Update Statistics without specifying how many rows to sample
    UPDATE STATISTICS dbo.Test IX_Test_TextValue;

    --View the Statistics
    DBCC SHOW_STATISTICS('dbo.Test', IX_Test_TextValue) WITH STAT_HEADER;


When I look at the output of the SHOW_STATISTICS I'm finding that the "Rows Sampled" varies with each full execution (i.e. the table gets dropped, recreated and repopulated). 

For example:

Rows Sampled

 - 318618 
 - 319240 
 - 324198 
 - 314154

My expectation was that this figure would be the same each time as the table is identical. By the way, I don't get this behaviour if I just delete the data and re-insert it. 
 
It's not a critical question, but I'd be interested in understanding what's going on.

Top Answer

Paul White (imported from SE)

## Background

Data for the statistics object are gathered using a statement of the form:

    SELECT 
        StatMan([SC0], [SC1], [SB0000]) 
    FROM 
    (
        SELECT TOP 100 PERCENT 
            [SC0], [SC1], STEP_DIRECTION([SC0]) OVER (ORDER BY NULL) AS [SB0000]
        FROM 
        (
            SELECT 
                [TextValue] AS [SC0], 
                [Id] AS [SC1] 
            FROM [dbo].[Test] 
                TABLESAMPLE SYSTEM (2.223684e+001 PERCENT) 
                WITH (READUNCOMMITTED) 
        ) AS _MS_UPDSTATS_TBL_HELPER 
        ORDER BY 
            [SC0], 
            [SC1], 
            [SB0000] 
    ) AS _MS_UPDSTATS_TBL
    OPTION (MAXDOP 1)

You can collect this statement with Extended Events or Profiler (`SP:StmtCompleted`).

Statistics generation queries often access the base table (rather than a nonclustered index) to avoid the clustering of values that naturally occurs on nonclustered index pages.

The number of rows sampled depends on the number of whole pages selected for sampling. Each page of the table is either selected or it is not. All rows on selected pages contribute to the statistics.

## Random numbers

SQL Server uses a random number generator to decide if a page qualifies or not. The generator used in this instance is the [Lehmer random number generator][1] with parameter values as shown below:

>**X~next~ = X~seed~ * 7^5^ mod (2^31^ - 1)**

The value of **X~seed~** is computed as the sum of:

* The low integer part of the (`bigint`) base table's `partition_id` e.g.

        SELECT
            P.[partition_id] & 0xFFFFFFFF
        FROM sys.partitions AS P
        WHERE
            P.[object_id] = OBJECT_ID(N'dbo.Test', N'U')
            AND P.index_id = 1;

* The value specified in the `REPEATABLE` clause
 * For sampled `UPDATE STATISTICS`, the `REPEATABLE` value is 1.
 * This value is exposed in the `m_randomSeed` element of the access method's internal debugging information shown in execution plans when trace flag 8666 is enabled, for example `<Field FieldName="m_randomSeed" FieldValue="1" />`

For SQL Server 2012, this calculation occurs in `sqlmin!UnOrderPageScanner::StartScan`:

``` none
mov     edx,dword ptr [rcx+30h]
add     edx,dword ptr [rcx+2Ch]
```

where memory at `[rcx+30h]` contains the low 32 bits of the partition id and memory at `[rcx+2Ch]` contains the `REPEATABLE` value in use.

The random number generator is initialized later in the same method, calling `sqlmin!RandomNumGenerator::Init`, where the instruction:

``` none
imul    r9d,r9d,41A7h
```

...multiplies the seed by `41A7` hex (16807 decimal = 7^5^) as shown in the equation above.

Later random numbers (for individual pages) are generated using the same basic code inlined into `sqlmin!UnOrderPageScanner::SetupSubScanner`.

## StatMan

For the example `StatMan` query shown above, the same pages will be collected as for the T-SQL statement:

    SELECT 
        COUNT_BIG(*) 
    FROM dbo.Test AS T 
        TABLESAMPLE SYSTEM (2.223684e+001 PERCENT)  -- Same sample %
        REPEATABLE (1)                              -- Always 1 for statman
        WITH (INDEX(0));                            -- Scan base object

This will match the output of:

    SELECT 
        DDSP.rows_sampled
    FROM sys.stats AS S
    CROSS APPLY sys.dm_db_stats_properties(S.[object_id], S.stats_id) AS DDSP
    WHERE 
        S.[object_id] = OBJECT_ID(N'dbo.Test', N'U')
        AND S.[name] = N'IX_Test_TextValue';

## Edge case

One consequence of using the MINSTD Lehmer random number generator is that seed values zero and int.max should not be used as this will result in the algorithm producing a sequence of zeroes (selecting every page).

The code detects zero, and uses a value from the system 'clock' as the seed in that case. It does not do the same if the seed is int.max (`0x7FFFFFFF` = 2^31^ - 1).

We can engineer this scenario since the initial seed is calculated as the sum of the low 32 bits of the partition id and the `REPEATABLE` value. The `REPEATABLE` value that will result in the seed being int.max and therefore every page being selected for sample is:

    SELECT
        0x7FFFFFFF - (P.[partition_id] & 0xFFFFFFFF)
    FROM sys.partitions AS P
    WHERE
        P.[object_id] = OBJECT_ID(N'dbo.Test', N'U')
        AND P.index_id = 1;

Working that into a complete example:

    DECLARE @SQL nvarchar(4000) = 
        N'
        SELECT
            COUNT_BIG(*) 
        FROM dbo.Test AS T 
            TABLESAMPLE (0 PERCENT) 
            REPEATABLE (' +
            (
                SELECT TOP (1)
                    CONVERT(nvarchar(11), 0x7FFFFFFF - P.[partition_id] & 0xFFFFFFFF)
                FROM sys.partitions AS P
                WHERE
                    P.[object_id] = OBJECT_ID(N'dbo.Test', N'U')
                    AND P.index_id = 1
            ) + ')
            WITH (INDEX(0));';
    
    PRINT @SQL;
    --EXECUTE (@SQL);

That will select every row on every page whatever the `TABLESAMPLE` clause says (even zero percent).

  [1]: https://en.wikipedia.org/wiki/Lehmer_random_number_generator

1 Answer