We have a query on a SQL Server 2016 SP2 CU12 Enterprise where the Query Optimizer estimates that only 1 row would come out of the *Nested Loops Join* operator, in reality 108501 rows came back. This caused that the `Sort` operator spilled to TempDB.
The estimates on the inner (index seek) and outer input (index seek) of the Nested Loops Join are correct.
I added trace flags 2363 (Selectivity Computation) and 3604 (redirect the output to the messages window) and here I found that there was a badly-formed histogram:
> Plan for computation:
>
> CSelCalcExpressionComparedToExpression( QCOL: [Object1].Column1 x_cmpEq QCOL: [Object3].Column18 )
>
> Loaded histogram for column QCOL: [Object1].Column1 from stats with id 1 *** WARNING: badly-formed histogram ***
>
> Loaded histogram for column QCOL: [Object3].Column18 from stats with id 9
>
> Selectivity: 1.07973e-009
>
> Stats collection generated:
>
> CStCollJoin(ID=4, CARD=1 x_jtLeftSemi)
>
> CStCollBaseTable(ID=1, CARD=5.01133e+007 TBL: Schema1.Table2 AS TBL: AA)
>
> CStCollFilter(ID=3, CARD=108210)
>
> CStCollBaseTable(ID=2, CARD=2.00511e+006 TBL: Schema1.Table1 AS TBL: A)
>
> End selectivity computation
Above is only a part of the output, the full text can be found [here][1]
When I updated the badly-formed histogram with fullscan the estimates are correct (without fullscan this issue is not resolved).
But as soon as one record is inserted in the table, the histogram is badly-formed again.
The query plan (with badly-formed histogram) can be found [here][2] and [here][3] you can find the query plan after update statistics.
No query optimizer fixes are enabled.
When I enable the original cardinality estimator for this query, using trace flag 9481, I'm getting the same query plan as after update statistics.
What can cause the badly-formed histogram?
Is there a way to solve this issue?
I tried the `PERSIST_SAMPLE_PERCENT` option but it didn't make a difference, the histogram gets also badly-formed.
[1]: https://docs.google.com/document/d/1YNpzJdMr8XRg39YHWXG6ZrF5Gn1eVL5gssTvVFtVGdU/edit?usp=sharing
[2]: https://www.brentozar.com/pastetheplan/?id=rkLdWeBXP
[3]: https://www.brentozar.com/pastetheplan/?id=BkJCTeS7v
Top Answer
Paul White
> What can cause the badly-formed histogram?
This is a bug of course. It occurs when SQL Server scales the inexact histogram and encounters a unique column with apparently more than one value per key. This was [partially addressed][1] for SQL Server 2016, but it seems to me the full fix is only present on SQL Server 2017 (I tested on CU21).
> Is there a way to solve this issue?
Use `FULLSCAN` statistics, the original CE, upgrade to SQL Server 2017, or wait for the full fix to be back-ported.
### Demo
Using my standard Numbers table creation script:
```
DROP TABLE IF EXISTS dbo.Numbers;
GO
-- Itzik Ben-Gan's row generator
WITH
L0 AS (SELECT 1 AS c UNION ALL SELECT 1),
L1 AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
L2 AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
L3 AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
L4 AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
L5 AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS n FROM L5)
SELECT
-- Destination column type integer NOT NULL
ISNULL(CONVERT(integer, N.n), 0) AS n
INTO dbo.Numbers
FROM Nums AS N
WHERE N.n >= 1
AND N.n <= 10 * 1000 * 1000
OPTION (MAXDOP 1);
GO
ALTER TABLE dbo.Numbers
ADD CONSTRAINT PK_Numbers_n
PRIMARY KEY CLUSTERED (n)
WITH (SORT_IN_TEMPDB = ON, MAXDOP = 1, FILLFACTOR = 100);
```
Update the statistics to be sampled:
```
UPDATE STATISTICS dbo.Numbers;
```
Trivial query (request an estimated plan only):
```
SELECT *
FROM dbo.Numbers AS N1
WHERE EXISTS
(
SELECT *
FROM dbo.Numbers AS N2
WHERE N2.n = N1.n
)
OPTION (QUERYTRACEON 3604, QUERYTRACEON 2363);
```
The trace output includes:
~~~none
Begin selectivity computation
Input tree:
LogOp_LeftSemiJoin
CStCollBaseTable(ID=1, CARD=1e+007 TBL: dbo.Numbers AS TBL: N1)
CStCollBaseTable(ID=2, CARD=1e+007 TBL: dbo.Numbers AS TBL: N2)
ScaOp_Comp x_cmpEq
ScaOp_Identifier QCOL: [N2].n
ScaOp_Identifier QCOL: [N1].n
Plan for computation:
CSelCalcExpressionComparedToExpression( QCOL: [N1].n x_cmpEq QCOL: [N2].n )
Loaded histogram for column QCOL: [N1].n from stats with id 1
*** WARNING: badly-formed histogram ***
Selectivity: 1
~~~
Validated on Microsoft SQL Server 2016 (SP2-CU14).
The statistics histogram obtained via:
```
DBCC SHOW_STATISTICS
(Numbers, [PK_Numbers_n])
WITH HISTOGRAM;
```
has entries like:
[![2016 histogram][2]][2]
on SQL Server 2017 the histogram is:
[![2017 histogram][3]][3]
Note the `EQ_ROWS` steps are pre-scaled rather than relying on the CE to do it.
[1]: https://support.microsoft.com/en-us/help/3202425/improvement-improves-the-query-performance-for-sql-server-2016-by-chan
[2]: https://i.stack.imgur.com/nBgVs.png
[3]: https://i.stack.imgur.com/t4w5s.png