I've been playing around investigating sampling thresholds with statistics updates on SQL Server (2012) and noticed some curious behaviour. Basically the number of rows sampled seems to vary under some circumstances - even with the same set of data.
I run this query:
--Drop table if exists
IF (OBJECT_ID('dbo.Test')) IS NOT NULL DROP TABLE dbo.Test;
--Create Table for Testing
CREATE TABLE dbo.Test(Id INT IDENTITY(1,1) CONSTRAINT PK_Test PRIMARY KEY CLUSTERED, TextValue VARCHAR(20) NULL);
--Insert enough data so we have more than 8Mb (the threshold at which sampling kicks in)
INSERT INTO dbo.Test(TextValue)
SELECT TOP 1000000 'blahblahblah'
FROM sys.objects a, sys.objects b, sys.objects c, sys.objects d;
--Create Index on TextValue
CREATE INDEX IX_Test_TextValue ON dbo.Test(TextValue);
--Update Statistics without specifying how many rows to sample
UPDATE STATISTICS dbo.Test IX_Test_TextValue;
--View the Statistics
DBCC SHOW_STATISTICS('dbo.Test', IX_Test_TextValue) WITH STAT_HEADER;
When I look at the output of the SHOW_STATISTICS I'm finding that the "Rows Sampled" varies with each full execution (i.e. the table gets dropped, recreated and repopulated).
For example:
Rows Sampled
- 318618
- 319240
- 324198
- 314154
My expectation was that this figure would be the same each time as the table is identical. By the way, I don't get this behaviour if I just delete the data and re-insert it.
It's not a critical question, but I'd be interested in understanding what's going on.
## Background
Data for the statistics object are gathered using a statement of the form:
SELECT
StatMan([SC0], [SC1], [SB0000])
FROM
(
SELECT TOP 100 PERCENT
[SC0], [SC1], STEP_DIRECTION([SC0]) OVER (ORDER BY NULL) AS [SB0000]
FROM
(
SELECT
[TextValue] AS [SC0],
[Id] AS [SC1]
FROM [dbo].[Test]
TABLESAMPLE SYSTEM (2.223684e+001 PERCENT)
WITH (READUNCOMMITTED)
) AS _MS_UPDSTATS_TBL_HELPER
ORDER BY
[SC0],
[SC1],
[SB0000]
) AS _MS_UPDSTATS_TBL
OPTION (MAXDOP 1)
You can collect this statement with Extended Events or Profiler (`SP:StmtCompleted`).
Statistics generation queries often access the base table (rather than a nonclustered index) to avoid the clustering of values that naturally occurs on nonclustered index pages.
The number of rows sampled depends on the number of whole pages selected for sampling. Each page of the table is either selected or it is not. All rows on selected pages contribute to the statistics.
## Random numbers
SQL Server uses a random number generator to decide if a page qualifies or not. The generator used in this instance is the [Lehmer random number generator][1] with parameter values as shown below:
>**X~next~ = X~seed~ * 7^5^ mod (2^31^ - 1)**
The value of **X~seed~** is computed as the sum of:
* The low integer part of the (`bigint`) base table's `partition_id` e.g.
SELECT
P.[partition_id] & 0xFFFFFFFF
FROM sys.partitions AS P
WHERE
P.[object_id] = OBJECT_ID(N'dbo.Test', N'U')
AND P.index_id = 1;
* The value specified in the `REPEATABLE` clause
* For sampled `UPDATE STATISTICS`, the `REPEATABLE` value is 1.
* This value is exposed in the `m_randomSeed` element of the access method's internal debugging information shown in execution plans when trace flag 8666 is enabled, for example `<Field FieldName="m_randomSeed" FieldValue="1" />`
For SQL Server 2012, this calculation occurs in `sqlmin!UnOrderPageScanner::StartScan`:
``` none
mov edx,dword ptr [rcx+30h]
add edx,dword ptr [rcx+2Ch]
```
where memory at `[rcx+30h]` contains the low 32 bits of the partition id and memory at `[rcx+2Ch]` contains the `REPEATABLE` value in use.
The random number generator is initialized later in the same method, calling `sqlmin!RandomNumGenerator::Init`, where the instruction:
``` none
imul r9d,r9d,41A7h
```
...multiplies the seed by `41A7` hex (16807 decimal = 7^5^) as shown in the equation above.
Later random numbers (for individual pages) are generated using the same basic code inlined into `sqlmin!UnOrderPageScanner::SetupSubScanner`.
## StatMan
For the example `StatMan` query shown above, the same pages will be collected as for the T-SQL statement:
SELECT
COUNT_BIG(*)
FROM dbo.Test AS T
TABLESAMPLE SYSTEM (2.223684e+001 PERCENT) -- Same sample %
REPEATABLE (1) -- Always 1 for statman
WITH (INDEX(0)); -- Scan base object
This will match the output of:
SELECT
DDSP.rows_sampled
FROM sys.stats AS S
CROSS APPLY sys.dm_db_stats_properties(S.[object_id], S.stats_id) AS DDSP
WHERE
S.[object_id] = OBJECT_ID(N'dbo.Test', N'U')
AND S.[name] = N'IX_Test_TextValue';
## Edge case
One consequence of using the MINSTD Lehmer random number generator is that seed values zero and int.max should not be used as this will result in the algorithm producing a sequence of zeroes (selecting every page).
The code detects zero, and uses a value from the system 'clock' as the seed in that case. It does not do the same if the seed is int.max (`0x7FFFFFFF` = 2^31^ - 1).
We can engineer this scenario since the initial seed is calculated as the sum of the low 32 bits of the partition id and the `REPEATABLE` value. The `REPEATABLE` value that will result in the seed being int.max and therefore every page being selected for sample is:
SELECT
0x7FFFFFFF - (P.[partition_id] & 0xFFFFFFFF)
FROM sys.partitions AS P
WHERE
P.[object_id] = OBJECT_ID(N'dbo.Test', N'U')
AND P.index_id = 1;
Working that into a complete example:
DECLARE @SQL nvarchar(4000) =
N'
SELECT
COUNT_BIG(*)
FROM dbo.Test AS T
TABLESAMPLE (0 PERCENT)
REPEATABLE (' +
(
SELECT TOP (1)
CONVERT(nvarchar(11), 0x7FFFFFFF - P.[partition_id] & 0xFFFFFFFF)
FROM sys.partitions AS P
WHERE
P.[object_id] = OBJECT_ID(N'dbo.Test', N'U')
AND P.index_id = 1
) + ')
WITH (INDEX(0));';
PRINT @SQL;
--EXECUTE (@SQL);
That will select every row on every page whatever the `TABLESAMPLE` clause says (even zero percent).
[1]: https://en.wikipedia.org/wiki/Lehmer_random_number_generator