At the moment I'm trying to figure out how SQL Server evaluates the cardinality of range predicates that partially cover the histogram step.

On the Internet, at [cardinality-estimation-for-and-for-intra-step-statistics-value][1] I came across a similar question and Paul White gave a rather interesting answer to it.

According to Paul's answer, the formulas for estimating cardinality for the predicates >= and > (in this case, I am only interested in the Cardinality estimator model of at least 120) are as follows:

For >:

    Cardinality = EQ_ROWS + (AVG_RANGE_ROWS * (F * (DISTINCT_RANGE_ROWS - 1)))

For >=:

    Cardinality = EQ_ROWS + (AVG_RANGE_ROWS * ((F * (DISTINCT_RANGE_ROWS - 1)) + 1))

I tested the application of these formulas on the *[Production].[TransactionHistory]* table of the *AdventureWorks2014* database based on the range predicate using the *TransactionDate* column and datetime range between '20140614' and '20140618'.

The statistics for the histogram step of this range are as follows:

[![Histogram][2]][2]
 
According to the formula, I calculated the cardinality for the following query:

    SELECT COUNT(1)
    FROM [AdventureWorks2014].[Production].[TransactionHistory]
    WHERE [TransactionDate] BETWEEN '20140615 00:00:00.000' AND '20140616 00:00:00.000'

The calculation was performed using the following code:

      DECLARE @predStart DATETIME =  '20140615 00:00:00.000'
      DECLARE @predEnd DATETIME = '20140616 00:00:00.000'
    
      DECLARE @stepStart DATETIME = '20140614 00:00:00.000'
      DECLARE @stepEnd DATETIME = '20140618 00:00:00.000'
    
      DECLARE @predRange FLOAT = DATEDIFF(ms, @predStart, @predEnd)
      DECLARE @stepRange FLOAT = DATEDIFF(ms, @stepStart, @stepEnd)
    
      DECLARE @F FLOAT = @predRange / @stepRange;
    
      DECLARE @avg_range_rows FLOAT = 100.3333
      DECLARE @distinct_range_rows INT = 3
      DECLARE @EQ_ROWS INT = 0
    
      SELECT @F AS 'F'
      
      --for new cardinality estimator
    
      SELECT @EQ_ROWS + @avg_range_rows * (@F * (@distinct_range_rows - 1) + 1) AS [new_card]

After calculating, I got the following results:

[![enter image description here][3]][3]
 
According to the formula, it turned out 150.5, but the optimizer estimates the predicate at 225.75 rows, and if you change the upper border of the predicate to ‘20140617’, the optimizer will already evaluate 250.833 rows, while using the formula we get only 200.6666 rows.

Please tell me, how does Cardinality Estimator evaluate in this case, maybe I made a mistake somewhere in my understanding of the quoted formulas?


  [1]: https://dba.stackexchange.com/questions/148523/cardinality-estimation-for-and-for-intra-step-statistics-value
  [2]: https://i.stack.imgur.com/sMCyC.png
  [3]: https://i.stack.imgur.com/loX3W.png
Top Answer
Paul White (imported from SE)
SQL Server uses different calculations in different situations. Your example is different from the linked Q & A because your range is entirely contained within a step; it does not cross a step boundary. It is also an interval with two ends rather than one. Writing `BETWEEN` is the same as writing two separate predicates with `>=` and `<=`.

### Interval with two boundaries, within a single step

The formula is modified to perform linear interpolation within the step for the number of distinct values expected, and reflect that two range endpoints are now specified (and assumed to exist within the histogram step) rather than one.

Using the histogram steps given in the question:

[![question histogram steps][3]][3]

For the query with `BETWEEN '20140615' AND '20140616'`, the calculation is:

```
DECLARE
    @Q1 float = CONVERT(float, CONVERT(datetime, '2014-06-15')),
    @Q2 float = CONVERT(float, CONVERT(datetime, '2014-06-16')),
    @K1 float = CONVERT(float, CONVERT(datetime, '2014-06-14')),
    @K2 float = CONVERT(float, CONVERT(datetime, '2014-06-18')),
    @RANGE_ROWS float = 301,
    @DISTINCT_RANGE_ROWS float = 3;

DECLARE
    @S1 float = (@Q1 - @K1) / (@K2 - @K1),
    @S2 float = (@Q2 - @K1) / (@K2 - @K1);

DECLARE
    @F float = @S2 - @S1;

DECLARE
    @AVG_RANGE_ROWS float = @RANGE_ROWS / @DISTINCT_RANGE_ROWS;

SELECT
    @AVG_RANGE_ROWS * ((@F * (@DISTINCT_RANGE_ROWS - 2)) + 2);
```

...giving **225.75**. Changing `@Q2` from `'20140616'` to `'20140617'` gives a result of **250.833**.

Both results match those given in the question.

  [1]: https://i.stack.imgur.com/ComWw.png
  [2]: https://i.stack.imgur.com/PmHiv.png
  [3]: https://i.stack.imgur.com/sMCyC.png

This room is for discussion about this question.

Once logged in you can direct comments to any contributor here.

Enter question or answer id or url (and optionally further answer ids/urls from the same question) from

Separate each id/url with a space. No need to list your own answers; they will be imported automatically.