I'm trying to understand how Statistics sampling works and whether or not the below is expected behaviour on sampled statistics updates.

We have a large table partitioned by date with a couple of billion rows. The partition date is the prior business date and so is an ascending key. We only load data into this table for the prior day.

The data load runs overnight, so on Friday 8th April we loaded data for the 7th.

After each run we update statistics, although take a sample, rather than a `FULLSCAN`.

Maybe I am being naïve, but I would have expected SQL Server identify the highest key and lowest key in the range to ensure it got an accurate range sample. According to [this article][1]:  

> For the first bucket, the lower boundary is the smallest value of the column on which the histogram is generated.

However, it doesn't mention the last bucket/largest value. 

With the sampled Statistics update on the morning of the 8th, the sample missed the highest value in the table (the 7th).

[![enter image description here][2]][2]

As we do a lot of querying on data from the prior day, this resulted in inaccurate cardinality estimation and a number of queries timing out.

Should SQL Server not identify the highest value for that key and use that as the maximum `RANGE_HI_KEY`? Or is this just one of the limits of update without using `FULLSCAN`?

Version SQL Server 2012 SP2-CU7. We cannot currently upgrade due to a change in `OPENQUERY` behaviour in SP3 that was rounding down numbers in a linked server query between SQL Server and Oracle.

  [1]: https://www.sqlservercentral.com/blogs/practicalsqldba/2013/06/27/sql-server-part-2-all-about-sql-server-statistics-histogram/
  [2]: https://i.stack.imgur.com/hdKC9.png
Top Answer
Paul White (imported from SE)
>Should SQL Server not identify the highest value for that key and use that as the maximum `RANGE_HI_KEY`? Or is this just one of the limits of update without using `FULLSCAN`?

It's a limitation of the current implementation of sampled statistics. As it stands, sampled statistics collection uses `TABLESAMPLE SYSTEM`, which uses an allocation-order scan and chooses pages from the scan to sample. Only chosen pages contribute to the histogram.

Since the scan is allocation-ordered (rather than index-ordered), there is no way to give preference to the first and last pages in key order.

For more information see this related question:


and my article, [Allocation Order Scans][1]

For workarounds, see [Statistics on Ascending Columns][2] by [Fabiano Amorim][3]

  [1]: http://sqlperformance.com/2015/01/t-sql-queries/allocation-order-scans
  [2]: https://www.simple-talk.com/sql/database-administration/statistics-on-ascending-columns/
  [3]: https://www.simple-talk.com/author/fabiano-amorim/

This room is for discussion about this question.

Once logged in you can direct comments to any contributor here.

Enter question or answer id or url (and optionally further answer ids/urls from the same question) from

Separate each id/url with a space. No need to list your own answers; they will be imported automatically.