Why does a subquery reduce the row estimate to 1?

sql-server sql-server-2016 add tag

Joe Obbish (imported from SE)

Consider the following contrived but simple query:

    SELECT 
      ID
    , CASE
    	WHEN ID <> 0 
    	THEN (SELECT TOP 1 ID FROM X_OTHER_TABLE) 
    	ELSE (SELECT TOP 1 ID FROM X_OTHER_TABLE_2) 
      END AS ID2
    FROM X_HEAP;

I would expect the final row estimate for this query to be equal to the number of rows in the `X_HEAP` table. Whatever I'm doing in the subquery shouldn't matter for the row estimate because it cannot filter out any rows. However, on SQL Server 2016 I see the row estimate reduced to 1 because of the subquery:

[![bad query][1]][1]

Why does this happen? What can I do about it?


It's very easy to reproduce this issue with the right syntax. Here is one set of table definitions that will do it:

    CREATE TABLE dbo.X_HEAP (ID INT NOT NULL)
    CREATE TABLE dbo.X_OTHER_TABLE (ID INT NOT NULL);
    CREATE TABLE dbo.X_OTHER_TABLE_2 (ID INT NOT NULL);
    
    INSERT INTO dbo.X_HEAP WITH (TABLOCK)
    SELECT TOP (1000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
    FROM master..spt_values;
    
    CREATE STATISTICS X_HEAP__ID ON X_HEAP (ID) WITH FULLSCAN;


db fiddle [link][2].


  [1]: https://i.stack.imgur.com/IU6jI.png
  [2]: http://dbfiddle.uk/?rdbms=sqlserver_2016&fiddle=7c1a04d76f8fec057545f72125f0ac90

Top Answer

Paul White (imported from SE)

## Summary

This cardinality estimation (CE) issue surfaces when:

1. The join is an **outer** join with a **pass-through** predicate
2. The **selectivity** of the pass-through predicate is estimated to be **exactly 1**.

Note: The particular calculator used to determine the selectivity is not important.

---

## Details

The CE computes the selectivity of the outer join as the **sum** of:

* The **inner join** selectivity with the same predicate
* The **anti join** selectivity with the same predicate

The only difference between an outer and inner join is that an outer join also returns rows that do not match on the join predicate. The anti join provides exactly this difference. Cardinality estimation for inner and anti join is easier than for outer join directly.

The join selectivity estimation process is very straightforward:

* First, the selectivity **S~PT~** of the pass-through predicate is assessed.  
 - This is done using whichever calculator is appropriate to the circumstances.  
 - The predicate is the whole thing, including any negating `IsFalseOrNull` component.
* Inner join selectivity := **1 - S~PT~**
* Anti join selectivity := **S~PT~**

The anti join represents rows that will 'pass through' the join. The inner join represents rows that will not 'pass through'. Note that 'pass through' means rows that flow through the join without running the inner side at all. To emphasise: all rows will be returned by the join, the distinction is between rows that run the inner side of the join before emerging, and those that do not.

Clearly, adding **1 - S~PT~** to **S~PT~** should always give a total selectivity of 1, meaning all rows are returned by the join, as expected.

The above calculation works exactly as described for all values of S~PT~ **except 1**.

When **S~PT~ = 1**, both inner join and anti join selectivities are **estimated to be zero**, resulting in a cardinality estimate (for the join as a whole) of one row. As far as I can tell, this is unintentional, and should be reported as a bug.

---

### A related issue

This bug is more likely to manifest than one might think, due to a separate CE limitation. This arises when the `CASE` expression uses an `EXISTS` clause (as is common). For example the following modified query from the question does **not** encounter the unexpected cardinality estimate:

    -- This is fine
    SELECT 
        CASE
            WHEN XH.ID = 1
            THEN (SELECT TOP (1) XOT.ID FROM dbo.X_OTHER_TABLE AS XOT) 
        END
    FROM dbo.X_HEAP AS XH;

Introducing a trivial `EXISTS` does cause the issue to surface:

    -- This is not fine
    SELECT 
        CASE
            WHEN EXISTS (SELECT 1 WHERE XH.ID = 1)
            THEN (SELECT TOP (1) XOT.ID FROM dbo.X_OTHER_TABLE AS XOT) 
        END
    FROM dbo.X_HEAP AS XH;

Using `EXISTS` introduces a semi join (highlighted) to the execution plan:

[![Semi join plan][1]][1]

The estimate for the semi join is fine. The problem is that the CE treats the associated probe column as a simple projection, with a **fixed selectivity of 1**:

```none
Semijoin with probe column treated as a Project.

Selectivity of probe column = 1
```

This automatically meets one of the conditions required for this CE issue to manifest, regardless of the contents of the `EXISTS` clause.

---

For important background information, see [Subqueries in `CASE` Expressions][2] by Craig Freedman.


  [1]: https://i.stack.imgur.com/0VvC3.png
  [2]: https://blogs.msdn.microsoft.com/craigfr/2006/08/23/subqueries-in-case-expressions/

Answer #2

Joe Obbish (imported from SE)

This definitely seems like unintended behavior. It is true that cardinality estimates do not need to be consistent at each step of a plan but this is a relatively simple query plan and the final cardinality estimate is inconsistent with what the query is doing. Such a low cardinality estimate could result in poor choices for join types and access methods for other tables downstream in a more complicated plan.

Through trial and error we can come up with a few similar queries for which the issue does not appear:

    SELECT 
      ID
    , CASE
    	WHEN ID <> 0 
    	THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
    	ELSE (SELECT -1) 
      END AS ID2
    FROM dbo.X_HEAP;
    
    SELECT 
      ID
    , CASE
    	WHEN ID < 500 
    	THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
    	WHEN ID >= 500 
    	THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2) 
      END AS ID2
    FROM dbo.X_HEAP;
    
We can also come up with more queries for which the issue appears:
    
    SELECT 
      ID
    , CASE
    	WHEN ID < 500 
    	THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
    	WHEN ID >= 500 
    	THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2) 
    	ELSE (SELECT TOP 1 ID FROM X_OTHER_TABLE) 
      END AS ID2
    FROM dbo.X_HEAP;
    
    SELECT 
      ID
    , CASE
    	WHEN ID = 0 
    	THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
    	ELSE (SELECT -1) 
      END AS ID2
    FROM dbo.X_HEAP;
    
    SELECT 
      ID
    , CASE
    	WHEN ID = 0 
    	THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
    	ELSE (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2) 
      END AS ID2
    FROM dbo.X_HEAP;

There appears to be a pattern: if there is an expression within the `CASE` that is not expected to be executed and the result expression is a subquery against a table then the row estimate falls to 1 after that expression.

If I write the query against a table with a clustered index the rules change somewhat. We can use the same data:

    CREATE TABLE dbo.X_CI (ID INT NOT NULL, PRIMARY KEY (ID))
    
    INSERT INTO dbo.X_CI WITH (TABLOCK)
    SELECT * FROM dbo.X_HEAP;
    
    UPDATE STATISTICS X_CI WITH FULLSCAN;

This query has a 1000 row final estimate:

    SELECT 
      ID
    , CASE
    	WHEN ID = 0 
    	THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2) 
    	ELSE (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
      END
    FROM dbo.X_CI;

But this query has a 1 row final estimate:

    SELECT 
      ID
    , CASE
    	WHEN ID <> 0 
    	THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
    	ELSE (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2) 
      END
    FROM dbo.X_CI;


To dig into this further we can use the undocumented [trace flag 2363][1] to get information about how the query optimizer performed selectivity calculations. I found it helpful to pair that trace flag with the undocumented [trace flag 8606][2]. TF 2363 seems to give selectivity computations for both the simplified tree and the tree after project normalization. Having both trace flags enabled makes it clear which calculations apply to which tree.

Let's try it for the original query posted in the question:

    SELECT 
      ID
    , CASE
    	WHEN ID <> 0 
    	THEN (SELECT TOP 1 ID FROM X_OTHER_TABLE) 
    	ELSE (SELECT TOP 1 ID FROM X_OTHER_TABLE_2) 
      END AS ID2
    FROM X_HEAP
    OPTION (QUERYTRACEON 3604, QUERYTRACEON 2363, QUERYTRACEON 8606);

Here is part of the part of output which I think is relevant along with some comments:

    Plan for computation:
    
      CSelCalcColumnInInterval -- this is the type of calculator used
    
          Column: QCOL: [SE_DB].[dbo].[X_HEAP].ID -- this is the column used for the calculation
    
    Pass-through selectivity: 0 -- all rows are expected to have a true value for the case expression
    
    Stats collection generated: 
    
      CStCollOuterJoin(ID=8, CARD=1000 x_jtLeftOuter) -- the row estimate after the join will still be 1000
    
          CStCollBaseTable(ID=1, CARD=1000 TBL: X_HEAP)
    
          CStCollBaseTable(ID=2, CARD=1 TBL: X_OTHER_TABLE)
    
    ...
    
    Plan for computation:
    
      CSelCalcColumnInInterval
    
          Column: QCOL: [SE_DB].[dbo].[X_HEAP].ID
    
    Pass-through selectivity: 1 -- no rows are expected to have a true value for the case expression
    
    Stats collection generated: 
    
      CStCollOuterJoin(ID=9, CARD=1 x_jtLeftOuter) -- the row estimate after the join will still be 1
    
          CStCollOuterJoin(ID=8, CARD=1000 x_jtLeftOuter) -- here is the row estimate after the previous join
    
              CStCollBaseTable(ID=1, CARD=1000 TBL: X_HEAP)
    
              CStCollBaseTable(ID=2, CARD=1 TBL: X_OTHER_TABLE)
    
          CStCollBaseTable(ID=3, CARD=1 TBL: X_OTHER_TABLE_2)

Now let's try it for a similar query that doesn't have the issue. I'm going to use this one:

    SELECT 
      ID
    , CASE
    	WHEN ID <> 0 
    	THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
    	ELSE (SELECT -1) 
      END AS ID2
    FROM dbo.X_HEAP
    OPTION (QUERYTRACEON 3604, QUERYTRACEON 2363, QUERYTRACEON 8606);


Debug output at the very end:

    Plan for computation:
    
      CSelCalcColumnInInterval
    
          Column: QCOL: [SE_DB].[dbo].[X_HEAP].ID
    
    Pass-through selectivity: 1
    
    Stats collection generated: 
    
      CStCollOuterJoin(ID=9, CARD=1000 x_jtLeftOuter)
    
          CStCollOuterJoin(ID=8, CARD=1000 x_jtLeftOuter)
    
              CStCollBaseTable(ID=1, CARD=1000 TBL: dbo.X_HEAP)
    
              CStCollBaseTable(ID=2, CARD=1 TBL: dbo.X_OTHER_TABLE)
    
          CStCollConstTable(ID=4, CARD=1) -- this is different than before because we select a constant instead of from a table


Let's try another query for which the bad row estimate is present:

    SELECT 
      ID
    , CASE
    	WHEN ID < 500 
    	THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
    	WHEN ID >= 500 
    	THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2) 
    	ELSE (SELECT TOP 1 ID FROM X_OTHER_TABLE) 
      END AS ID2
    FROM dbo.X_HEAP
    OPTION (QUERYTRACEON 3604, QUERYTRACEON 2363, QUERYTRACEON 8606);

At the very end the cardinality estimate drops to 1 row, again after Pass-through selectivity = 1. The cardinality estimate is preserved after a selectivity of 0.501 and 0.499.


    Plan for computation:
    
     CSelCalcColumnInInterval
    
          Column: QCOL: [SE_DB].[dbo].[X_HEAP].ID
    
    Pass-through selectivity: 0.501
    
    ...
    
    Plan for computation:
    
      CSelCalcColumnInInterval
    
          Column: QCOL: [SE_DB].[dbo].[X_HEAP].ID
    
    Pass-through selectivity: 0.499
    
    ...
    
    Plan for computation:
    
      CSelCalcColumnInInterval
    
          Column: QCOL: [SE_DB].[dbo].[X_HEAP].ID
    
    Pass-through selectivity: 1
    
    Stats collection generated: 
    
      CStCollOuterJoin(ID=12, CARD=1 x_jtLeftOuter) -- this is associated with the ELSE expression
    
          CStCollOuterJoin(ID=11, CARD=1000 x_jtLeftOuter)
    
              CStCollOuterJoin(ID=10, CARD=1000 x_jtLeftOuter)
    
                  CStCollBaseTable(ID=1, CARD=1000 TBL: dbo.X_HEAP)
    
                  CStCollBaseTable(ID=2, CARD=1 TBL: dbo.X_OTHER_TABLE)
    
              CStCollBaseTable(ID=3, CARD=1 TBL: dbo.X_OTHER_TABLE_2)
    
          CStCollBaseTable(ID=4, CARD=1 TBL: X_OTHER_TABLE)


Let's again switch to another similiar query that does not have the issue. I'm going to use this one:

    SELECT 
      ID
    , CASE
    	WHEN ID < 500 
    	THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
    	WHEN ID >= 500 
    	THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2) 
      END AS ID2
    FROM dbo.X_HEAP
    OPTION (QUERYTRACEON 3604, QUERYTRACEON 2363, QUERYTRACEON 8606);

In the debug output there is never a step which has a pass-through selectivity of 1. The cardinality estimate stays at 1000 rows.

    Plan for computation:
    
      CSelCalcColumnInInterval
    
          Column: QCOL: [SE_DB].[dbo].[X_HEAP].ID
    
    Pass-through selectivity: 0.499
    
    Stats collection generated: 
    
      CStCollOuterJoin(ID=9, CARD=1000 x_jtLeftOuter)
    
          CStCollOuterJoin(ID=8, CARD=1000 x_jtLeftOuter)
    
              CStCollBaseTable(ID=1, CARD=1000 TBL: dbo.X_HEAP)
    
              CStCollBaseTable(ID=2, CARD=1 TBL: dbo.X_OTHER_TABLE)
    
          CStCollBaseTable(ID=3, CARD=1 TBL: dbo.X_OTHER_TABLE_2)
    
    End selectivity computation


What about the query when it involves a table with a clustered index? Consider the following query with the row estimate issue:

    SELECT 
      ID
    , CASE
    	WHEN ID <> 0 
    	THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
    	ELSE (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2) 
      END
    FROM dbo.X_CI
    OPTION (QUERYTRACEON 3604, QUERYTRACEON 2363, QUERYTRACEON 8606);

The end of the debug output is similar to what we already have seen:


    Plan for computation:
    
      CSelCalcColumnInInterval
    
          Column: QCOL: [SE_DB].[dbo].[X_CI].ID
    
    Pass-through selectivity: 1
    
    Stats collection generated: 
    
      CStCollOuterJoin(ID=9, CARD=1 x_jtLeftOuter)
    
          CStCollOuterJoin(ID=8, CARD=1000 x_jtLeftOuter)
    
              CStCollBaseTable(ID=1, CARD=1000 TBL: dbo.X_CI)
    
              CStCollBaseTable(ID=2, CARD=1 TBL: dbo.X_OTHER_TABLE)
    
          CStCollBaseTable(ID=3, CARD=1 TBL: dbo.X_OTHER_TABLE_2)

However, the query against the CI without the issue has different output. Using this query:

    SELECT 
      ID
    , CASE
    	WHEN ID = 0 
    	THEN (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE_2) 
    	ELSE (SELECT TOP 1 ID FROM dbo.X_OTHER_TABLE) 
      END
    FROM dbo.X_CI
    OPTION (QUERYTRACEON 3604, QUERYTRACEON 2363, QUERYTRACEON 8606);

Results in different calculators being used. `CSelCalcColumnInInterval` no longer appears:


    Plan for computation:
    
      CSelCalcFixedFilter (0.559)
    
    Pass-through selectivity: 0.559
    
    Stats collection generated: 
    
      CStCollOuterJoin(ID=8, CARD=1000 x_jtLeftOuter)
    
          CStCollBaseTable(ID=1, CARD=1000 TBL: dbo.X_CI)
    
          CStCollBaseTable(ID=2, CARD=1 TBL: dbo.X_OTHER_TABLE_2)
    
    ...
    
    Plan for computation:
    
      CSelCalcUniqueKeyFilter
    
    Pass-through selectivity: 0.001
    
    Stats collection generated: 
    
      CStCollOuterJoin(ID=9, CARD=1000 x_jtLeftOuter)
    
          CStCollOuterJoin(ID=8, CARD=1000 x_jtLeftOuter)
    
              CStCollBaseTable(ID=1, CARD=1000 TBL: dbo.X_CI)
    
              CStCollBaseTable(ID=2, CARD=1 TBL: dbo.X_OTHER_TABLE_2)
    
          CStCollBaseTable(ID=3, CARD=1 TBL: dbo.X_OTHER_TABLE)


In conclusion, we appear to get a bad row estimate after the subquery under the following conditions:

 1. The `CSelCalcColumnInInterval` selectivity calculator is used. I don't know exactly when this is used but it seems to show up much more often when the base table is a heap.

 2. Pass-through selectivity = 1. In other words, one of the `CASE` expressions is expected to evaluated to false for all of the rows. It does not matter if the first `CASE` expression evaluates to true for all rows.

 3. There is an outer join to `CStCollBaseTable`. In other words, the `CASE` result expression is a subquery against a table. A constant value will not work.

Perhaps under those conditions the query optimizer is unintentionally applying the pass-through selectivity to the row estimate of the outer table instead of to the work done on the inner part of the nested loop. That would reduce the row estimate to 1.

I was able to find two workarounds. I was not able to reproduce the issue when using `APPLY` instead of a subquery. The output of trace flag 2363 was very different with `APPLY`. Here's one way to rewrite the original query in the question:

    SELECT 
      h.ID
    , a.ID2
    FROM X_HEAP h
    OUTER APPLY
    (
    SELECT CASE
    	WHEN ID <> 0 
    	THEN (SELECT TOP 1 ID FROM X_OTHER_TABLE) 
    	ELSE (SELECT TOP 1 ID FROM X_OTHER_TABLE_2) 
      END
    ) a(ID2);

[![good query 1][3]][3]

The legacy CE appears to avoid the issue as well.

    SELECT 
      ID
    , CASE
    	WHEN ID <> 0 
    	THEN (SELECT TOP 1 ID FROM X_OTHER_TABLE) 
    	ELSE (SELECT TOP 1 ID FROM X_OTHER_TABLE_2) 
      END AS ID2
    FROM X_HEAP
    OPTION (USE HINT('FORCE_LEGACY_CARDINALITY_ESTIMATION'));

[![good query 2][4]][4]

A [connect item][5] was submitted for this issue (with some of the details that Paul White provided in his answer).

  [1]: https://sqlperformance.com/2014/01/sql-plan/cardinality-estimation-for-multiple-predicates
  [2]: http://sqlblog.com/blogs/paul_white/archive/2012/05/01/query-optimizer-deep-dive-part-4.aspx
  [3]: https://i.stack.imgur.com/DfAN1.png
  [4]: https://i.stack.imgur.com/5eGyF.png
  [5]: https://connect.microsoft.com/SQLServer/feedback/details/3133394/subquery-in-select-reduces-row-estimate-to-1

2 Answers