I've been playing around investigating sampling thresholds with statistics updates on SQL Server (2012) and noticed some curious behaviour. Basically the number of rows sampled seems to vary under some circumstances - even with the same set of data. I run this query: --Drop table if exists IF (OBJECT_ID('dbo.Test')) IS NOT NULL DROP TABLE dbo.Test; --Create Table for Testing CREATE TABLE dbo.Test(Id INT IDENTITY(1,1) CONSTRAINT PK_Test PRIMARY KEY CLUSTERED, TextValue VARCHAR(20) NULL); --Insert enough data so we have more than 8Mb (the threshold at which sampling kicks in) INSERT INTO dbo.Test(TextValue) SELECT TOP 1000000 'blahblahblah' FROM sys.objects a, sys.objects b, sys.objects c, sys.objects d; --Create Index on TextValue CREATE INDEX IX_Test_TextValue ON dbo.Test(TextValue); --Update Statistics without specifying how many rows to sample UPDATE STATISTICS dbo.Test IX_Test_TextValue; --View the Statistics DBCC SHOW_STATISTICS('dbo.Test', IX_Test_TextValue) WITH STAT_HEADER; When I look at the output of the SHOW_STATISTICS I'm finding that the "Rows Sampled" varies with each full execution (i.e. the table gets dropped, recreated and repopulated). For example: Rows Sampled - 318618 - 319240 - 324198 - 314154 My expectation was that this figure would be the same each time as the table is identical. By the way, I don't get this behaviour if I just delete the data and re-insert it. It's not a critical question, but I'd be interested in understanding what's going on.
Paul White (imported from SE)
## Background Data for the statistics object are gathered using a statement of the form: SELECT StatMan([SC0], [SC1], [SB0000]) FROM ( SELECT TOP 100 PERCENT [SC0], [SC1], STEP_DIRECTION([SC0]) OVER (ORDER BY NULL) AS [SB0000] FROM ( SELECT [TextValue] AS [SC0], [Id] AS [SC1] FROM [dbo].[Test] TABLESAMPLE SYSTEM (2.223684e+001 PERCENT) WITH (READUNCOMMITTED) ) AS _MS_UPDSTATS_TBL_HELPER ORDER BY [SC0], [SC1], [SB0000] ) AS _MS_UPDSTATS_TBL OPTION (MAXDOP 1) You can collect this statement with Extended Events or Profiler (`SP:StmtCompleted`). Statistics generation queries often access the base table (rather than a nonclustered index) to avoid the clustering of values that naturally occurs on nonclustered index pages. The number of rows sampled depends on the number of whole pages selected for sampling. Each page of the table is either selected or it is not. All rows on selected pages contribute to the statistics. ## Random numbers SQL Server uses a random number generator to decide if a page qualifies or not. The generator used in this instance is the [Lehmer random number generator] with parameter values as shown below: >**X~next~ = X~seed~ * 7^5^ mod (2^31^ - 1)** The value of **X~seed~** is computed as the sum of: * The low integer part of the (`bigint`) base table's `partition_id` e.g. SELECT P.[partition_id] & 0xFFFFFFFF FROM sys.partitions AS P WHERE P.[object_id] = OBJECT_ID(N'dbo.Test', N'U') AND P.index_id = 1; * The value specified in the `REPEATABLE` clause * For sampled `UPDATE STATISTICS`, the `REPEATABLE` value is 1. * This value is exposed in the `m_randomSeed` element of the access method's internal debugging information shown in execution plans when trace flag 8666 is enabled, for example `<Field FieldName="m_randomSeed" FieldValue="1" />` For SQL Server 2012, this calculation occurs in `sqlmin!UnOrderPageScanner::StartScan`: ``` none mov edx,dword ptr [rcx+30h] add edx,dword ptr [rcx+2Ch] ``` where memory at `[rcx+30h]` contains the low 32 bits of the partition id and memory at `[rcx+2Ch]` contains the `REPEATABLE` value in use. The random number generator is initialized later in the same method, calling `sqlmin!RandomNumGenerator::Init`, where the instruction: ``` none imul r9d,r9d,41A7h ``` ...multiplies the seed by `41A7` hex (16807 decimal = 7^5^) as shown in the equation above. Later random numbers (for individual pages) are generated using the same basic code inlined into `sqlmin!UnOrderPageScanner::SetupSubScanner`. ## StatMan For the example `StatMan` query shown above, the same pages will be collected as for the T-SQL statement: SELECT COUNT_BIG(*) FROM dbo.Test AS T TABLESAMPLE SYSTEM (2.223684e+001 PERCENT) -- Same sample % REPEATABLE (1) -- Always 1 for statman WITH (INDEX(0)); -- Scan base object This will match the output of: SELECT DDSP.rows_sampled FROM sys.stats AS S CROSS APPLY sys.dm_db_stats_properties(S.[object_id], S.stats_id) AS DDSP WHERE S.[object_id] = OBJECT_ID(N'dbo.Test', N'U') AND S.[name] = N'IX_Test_TextValue'; ## Edge case One consequence of using the MINSTD Lehmer random number generator is that seed values zero and int.max should not be used as this will result in the algorithm producing a sequence of zeroes (selecting every page). The code detects zero, and uses a value from the system 'clock' as the seed in that case. It does not do the same if the seed is int.max (`0x7FFFFFFF` = 2^31^ - 1). We can engineer this scenario since the initial seed is calculated as the sum of the low 32 bits of the partition id and the `REPEATABLE` value. The `REPEATABLE` value that will result in the seed being int.max and therefore every page being selected for sample is: SELECT 0x7FFFFFFF - (P.[partition_id] & 0xFFFFFFFF) FROM sys.partitions AS P WHERE P.[object_id] = OBJECT_ID(N'dbo.Test', N'U') AND P.index_id = 1; Working that into a complete example: DECLARE @SQL nvarchar(4000) = N' SELECT COUNT_BIG(*) FROM dbo.Test AS T TABLESAMPLE (0 PERCENT) REPEATABLE (' + ( SELECT TOP (1) CONVERT(nvarchar(11), 0x7FFFFFFF - P.[partition_id] & 0xFFFFFFFF) FROM sys.partitions AS P WHERE P.[object_id] = OBJECT_ID(N'dbo.Test', N'U') AND P.index_id = 1 ) + ') WITH (INDEX(0));'; PRINT @SQL; --EXECUTE (@SQL); That will select every row on every page whatever the `TABLESAMPLE` clause says (even zero percent). : https://en.wikipedia.org/wiki/Lehmer_random_number_generator