How does SQL Server's optimizer estimate the number of rows in a joined table? - Databases

TopAnswers Databases

Meta

Databases

TeX

Code Golf

APL

C++

.net

db<>fiddle

Java

*nix

PHP

PowerShell

Python

Rust

टेक्-मराठी

Typst

Web Client Dev

Web Server Dev

How does SQL Server's optimizer estimate the number of rows in a joined table?

sql-server sql-server-2012 add tag

8kb (imported from SE)

I am running this query in the [AdventureWorks2012][1] database: 

    SELECT 
    	s.SalesOrderID,
    	d.CarrierTrackingNumber,
    	d.ProductID,
    	d.OrderQty
    FROM Sales.SalesOrderHeader s 
    JOIN Sales.SalesOrderDetail d 
    	ON s.SalesOrderID = d.SalesOrderID
    WHERE s.CustomerID = 11077

If I look at the estimated execution plan, I see the following: 

![enter image description here][2]

The initial index seek (top right) is using the IX_SalesOrderHeader_CustomerID index and searching on the literal 11077. It has an estimate of 2.6192 rows. 

![enter image description here][3]

If I use `DBCC SHOW_STATISTICS ('Sales.SalesOrderHeader', 'IX_SalesOrderHeader_CustomerID') WITH HISTOGRAM`, it shows that the value 11077 is between the two sampled keys 11019 and 11091. 

![enter image description here][4]

The average number of distinct rows between 11019 and 11091 is 2.619718, or rounded to 2.61972 which is the value of estimated rows shown for the index seek.

The part I don't understand is the estimated number of rows for the clustered index seek against the SalesOrderDetail table. 

![enter image description here][5]

If I run `DBCC SHOW_STATISTICS ('Sales.SalesOrderDetail', 'PK_SalesOrderDetail_SalesOrderID_SalesOrderDetailID')`:

![enter image description here][6]

So the density of the SalesOrderID (which I am joining on) is 3.178134E-05. That means that 1/3.178134E-05 (31465) equals the number of unique SalesOrderID values in the SalesOrderDetail table.

If there are 31465 unique SalesOrderID's in the SalesOrderDetail, then with an even distribution, the average number of rows per SalesOrderID is 121317 (total number of rows) divided by 31465. The average is 3.85561

So if the estimated number of rows to be loop through is 2.61972, and the average to be returned in 3.85561, the I would think the estimated number of rows would be 2.61972 * 3.85561 = 10.10062.

But the estimated number of rows is 11.4867.

I think my understanding of the second estimate is incorrect and the differing numbers seems to indicate that. What am I missing? 

  [1]: https://msftdbprodsamples.codeplex.com/releases/view/93587
  [2]: http://i.stack.imgur.com/sagIZ.jpg
  [3]: http://i.stack.imgur.com/U0YHf.jpg
  [4]: http://i.stack.imgur.com/PlnIE.jpg
  [5]: http://i.stack.imgur.com/BWiRG.jpg
  [6]: http://i.stack.imgur.com/YkX6J.jpg

Top Answer

Paul White

>*I think my understanding of the second estimate is incorrect and the differing numbers seems to indicate that. What am I missing?*

Using the SQL Server 2012 cardinality estimator, the selectivity of the join drives the estimated number of rows on the inner side of the nested loops join, and not the other way around.

The 11.4867 number is *derived* (for display in showplan) by dividing the computed estimated cardinality of the join output (30.0919) by the number of iterations (2.61972). The result, using single-precision floating-point arithmetic, is **11.4867**.

It really is as simple as that. Note that the (logical) join selectivity is independent of the choice of physical join operator. It remains the same whether the join is ultimately performed using a Nested Loops, Hash, or Merge Join physical operator.

In SQL Server 2012 and earlier, the join selectivity (as a whole) is estimated using the `SalesOrderID` histograms from each table (computed for each histogram step, after step boundary alignment using linear interpolation as necessary). The `SalesOrderID` histogram associated with the `SalesOrderHeader` table is also adjusted for the scaling effect of the independent `CustomerID` filter.

That is not to say there is anything fundamentally 'wrong' with the alternate calculation proposed in the question; it just makes a different set of assumptions. There will always be different ways to compute or combine estimates for a given sequence of logical operations. There is no general guarantee that different statistical methods applied to the same data will produce the same answers, or that one method would always be superior to the other. Inconsistencies resulting from the application of different statistical methods can even appear within a single final execution plan, though they are rarely noticed.

As a side-note, the SQL Server 2014 cardinality estimator takes a different approach to combining the independent-filter-adjusted histogram information (["coarse alignment"][1]), which results in a different final estimate of **10.1006** rows for this query:

```none
Plan for computation:

  CSelCalcExpressionComparedToExpression
  (QCOL: [s].SalesOrderID x_cmpEq QCOL: [d].SalesOrderID)
    
Loaded histogram for column QCOL: [s].SalesOrderID from stats with id 1
Loaded histogram for column QCOL: [d].SalesOrderID from stats with id 1
    
Stats collection generated: 
    
  CStCollJoin(ID=4, **CARD=10.1006** x_jtInner)
      CStCollFilter(ID=3, CARD=2.61972)
          CStCollBaseTable(ID=1, CARD=31465 TBL: Sales.SalesOrderHeader AS TBL: s)
      CStCollBaseTable(ID=2, CARD=121317 TBL: Sales.SalesOrderDetail AS TBL: d)
```

This happens to be the same result as the calculation in the question, though the detailed reasoning is different (i.e. it is not based on an assumed nested loops implementation).


  [1]: https://sqlperformance.com/2018/11/sql-optimizer/histogram-coarse-alignment

1 Answer