Missed opportunity for a seek where everything is NULL - Databases

TopAnswers Databases
Top Answer
Paul White
This is an issue I see quite often when the 'new' cardinality estimator is used.

## Preamble

Your `IN` query is parsed by SQL Server as a [`SOME` comparison](https://docs.microsoft.com/en-us/sql/t-sql/language-elements/some-any-transact-sql), pretty much as if you had written:

```
declare @T table(ID int primary key);
insert into @T(ID) values(1);

select T.ID,
       T.C
from dbo.T 
where T.C = some (select ID from @T)
```

~~~
*** Converted Tree: ***
    LogOp_Project QCOL: [dbo].[T].ID QCOL: [dbo].[T].C
        LogOp_Select
            LogOp_Get TBL: dbo.T 
            ScaOp_SomeComp 2
                ScaOp_Identifier QCOL: [dbo].[T].C
                LogOp_Project
                    LogOp_Get TBL: @T Hints( NOLOCK )
                    AncOp_PrjList 
        AncOp_PrjList
~~~

SQL Server cannot work with subqueries in a relational selection directly, so this is transformed to an *apply* (left semi type) with an equality comparison by rule `RemoveSubqInSel`:

~~~
    LogOp_Apply (x_jtLeftSemi)
      LogOp_Get TBL: dbo.T
      LogOp_Select
        LogOp_Get TBL: @T @T Hints( NOLOCK )
        ScaOp_Comp x_cmpEq
          ScaOp_Identifier QCOL: [dbo].[T].C
          ScaOp_Identifier QCOL: .ID
~~~

So the starting point of your query is in fact an apply semi join (as desired). It all goes downhill from there.

## Cardinality Estimation

The new CE will sometimes use a [*Simple Join*](https://www.sqlshack.com/join-estimation-internals/) calculator to estimate the selectivity of a join. This approach does not use histograms, but simply assumes that all distinct values on one side of the join are found among the distinct values on the other side. It's not the greatest strategy in the world, but it is fast.

### All nulls

In your particular case (with all nulls) the calculation looks like this:

~~~
Plan for computation:

  CSelCalcSimpleJoinWithDistinctCounts (Using base cardinality)
      CDVCPlanLeaf
          0 Multi-Column Stats, 1 Single-Column Stats, 0 Guesses
      CDVCPlanUniqueKey
Loaded histogram for column QCOL: [dbo].[T].C from stats with id 3

Using ambient cardinality 10000 to combine distinct counts:
  1

Combined distinct count: 1

Selectivity: 1

Stats collection generated: 

  CStCollJoin(ID=3, CARD=10000 x_jtLeftSemi)
      CStCollBaseTable(ID=1, CARD=10000 TBL: dbo.T)
      CStCollBlackBox(ID=2, CARD=1)

End selectivity computation
~~~

In short, it sees one distinct value in the statistics for the real table (null), and one row (with an unknown value) in the table variable.

Somewhat optimistically, the *Simple Join* algorithm assumes these distinct values match, hence **100% selectivity**. On that basis, a *Merge Join* has a lower implementation cost than *Nested Loops Apply*.

### Nulls and a -1

When you introduce a second distinct value (-1):

~~~
Plan for computation:

  CSelCalcSimpleJoinWithDistinctCounts (Using base cardinality)
      CDVCPlanLeaf
          0 Multi-Column Stats, 1 Single-Column Stats, 0 Guesses
      CDVCPlanUniqueKey
Loaded histogram for column QCOL: [dbo].[T].C from stats with id 3

Using ambient cardinality 10000 to combine distinct counts:
  2

Combined distinct count: 2

Selectivity: 0.5

Stats collection generated: 

  CStCollJoin(ID=3, CARD=5000 x_jtLeftSemi)
      CStCollBaseTable(ID=1, CARD=10000 TBL: dbo.T)
      CStCollBlackBox(ID=2, CARD=1)

End selectivity computation
~~~

There is now one distinct value in the table variable, and two in the real table, so the selectivity estimate is **50%**. This is enough to make the apply option cheaper than merge (though still a poor estimate).

### Workarounds

1. Use the original cardinality estimator, e.g.:

        USE HINT ('FORCE_LEGACY_CARDINALITY_ESTIMATION')

2. Use undocumented model variation trace flag 9473 to allow zero selectivity.

Only the first option would be supported.

---

Related:

@@@ answer 1497
1 Answer