or
sql-server sql-server-2012
mark sinkinson imported from SE
I'm trying to understand how Statistics sampling works and whether or not the below is expected behaviour on sampled statistics updates.

We have a large table partitioned by date with a couple of billion rows. The partition date is the prior business date and so is an ascending key. We only load data into this table for the prior day.

The data load runs overnight, so on Friday 8th April we loaded data for the 7th.

After each run we update statistics, although take a sample, rather than a `FULLSCAN`.

Maybe I am being naïve, but I would have expected SQL Server identify the highest key and lowest key in the range to ensure it got an accurate range sample. According to [this article][1]:  

> For the first bucket, the lower boundary is the smallest value of the column on which the histogram is generated.

However, it doesn't mention the last bucket/largest value. 

With the sampled Statistics update on the morning of the 8th, the sample missed the highest value in the table (the 7th).

[![enter image description here][2]][2]

As we do a lot of querying on data from the prior day, this resulted in inaccurate cardinality estimation and a number of queries timing out.

Should SQL Server not identify the highest value for that key and use that as the maximum `RANGE_HI_KEY`? Or is this just one of the limits of update without using `FULLSCAN`?

Version SQL Server 2012 SP2-CU7. We cannot currently upgrade due to a change in `OPENQUERY` behaviour in SP3 that was rounding down numbers in a linked server query between SQL Server and Oracle.


  [1]: https://www.sqlservercentral.com/blogs/practicalsqldba/2013/06/27/sql-server-part-2-all-about-sql-server-statistics-histogram/
  [2]: https://i.stack.imgur.com/hdKC9.png
Top Answer
Paul White
>Should SQL Server not identify the highest value for that key and use that as the maximum `RANGE_HI_KEY`? Or is this just one of the limits of update without using `FULLSCAN`?

It's a limitation of the current implementation of sampled statistics. As it stands, sampled statistics collection uses `TABLESAMPLE SYSTEM`, which uses an allocation-order scan and chooses pages from the scan to sample. Only chosen pages contribute to the histogram.

Since the scan is allocation-ordered (rather than index-ordered), there is no way to give preference to the first and last pages in key order.

For more information see this related question:

https://dba.stackexchange.com/q/34883/1192

and my article, [Allocation Order Scans][1]

For workarounds, see [Statistics on Ascending Columns][2] by [Fabiano Amorim][3]


  [1]: http://sqlperformance.com/2015/01/t-sql-queries/allocation-order-scans
  [2]: https://www.simple-talk.com/sql/database-administration/statistics-on-ascending-columns/
  [3]: https://www.simple-talk.com/author/fabiano-amorim/
SQL Server sample Update of Statistics misses highest RANGE_HI_KEY on ascending key column

This is a dedicated room for discussion about this question.

Once logged in you can direct comments to the question poster (or any answer poster) here.