We have a query on a SQL Server 2016 SP2 CU12 Enterprise where the Query Optimizer estimates that only 1 row would come out of the *Nested Loops Join* operator, in reality 108501 rows came back. This caused that the `Sort` operator spilled to TempDB.

The estimates on the inner (index seek) and outer input (index seek) of the Nested Loops Join are correct.

I added trace flags 2363 (Selectivity Computation) and 3604 (redirect the output to the messages window) and here I found that there was a badly-formed histogram:

>     Plan for computation:
>     
>       CSelCalcExpressionComparedToExpression( QCOL: [Object1].Column1 x_cmpEq QCOL: [Object3].Column18 )
>     
>     Loaded histogram for column QCOL: [Object1].Column1 from stats with id 1	*** WARNING: badly-formed histogram ***
>     
>     Loaded histogram for column QCOL: [Object3].Column18 from stats with id 9
>     
>     Selectivity: 1.07973e-009
>     
>     Stats collection generated: 
>     
>       CStCollJoin(ID=4, CARD=1 x_jtLeftSemi)
>     
>           CStCollBaseTable(ID=1, CARD=5.01133e+007 TBL: Schema1.Table2 AS TBL: AA)
>     
>           CStCollFilter(ID=3, CARD=108210)
>     
>               CStCollBaseTable(ID=2, CARD=2.00511e+006 TBL: Schema1.Table1 AS TBL: A)
>     
>     End selectivity computation

Above is only a part of the output, the full text can be found [here][1]

When I updated the badly-formed histogram with fullscan the estimates are correct (without fullscan this issue is not resolved).

But as soon as one record is inserted in the table, the histogram is badly-formed again.

The query plan (with badly-formed histogram) can be found [here][2] and [here][3] you can find the query plan after update statistics.

No query optimizer fixes are enabled. 
When I enable the original cardinality estimator for this query, using trace flag 9481, I'm getting the same query plan as after update statistics.

What can cause the badly-formed histogram?

Is there a way to solve this issue?

I tried the `PERSIST_SAMPLE_PERCENT` option but it didn't make a difference, the histogram gets also badly-formed.


  [1]: https://docs.google.com/document/d/1YNpzJdMr8XRg39YHWXG6ZrF5Gn1eVL5gssTvVFtVGdU/edit?usp=sharing
  [2]: https://www.brentozar.com/pastetheplan/?id=rkLdWeBXP
  [3]: https://www.brentozar.com/pastetheplan/?id=BkJCTeS7v
Top Answer
Paul White
> What can cause the badly-formed histogram?

This is a bug of course. It occurs when SQL Server scales the inexact histogram and encounters a unique column with apparently more than one value per key. This was [partially addressed][1] for SQL Server 2016, but it seems to me the full fix is only present on SQL Server 2017 (I tested on CU21).

> Is there a way to solve this issue?

Use `FULLSCAN` statistics, the original CE, upgrade to SQL Server 2017, or wait for the full fix to be back-ported.

### Demo

Using my standard Numbers table creation script:

```
DROP TABLE IF EXISTS dbo.Numbers;
GO
-- Itzik Ben-Gan's row generator
WITH
  L0   AS (SELECT 1 AS c UNION ALL SELECT 1),
  L1   AS (SELECT 1 AS c FROM L0 AS A CROSS JOIN L0 AS B),
  L2   AS (SELECT 1 AS c FROM L1 AS A CROSS JOIN L1 AS B),
  L3   AS (SELECT 1 AS c FROM L2 AS A CROSS JOIN L2 AS B),
  L4   AS (SELECT 1 AS c FROM L3 AS A CROSS JOIN L3 AS B),
  L5   AS (SELECT 1 AS c FROM L4 AS A CROSS JOIN L4 AS B),
  Nums AS (SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS n FROM L5)
SELECT
    -- Destination column type integer NOT NULL
    ISNULL(CONVERT(integer, N.n), 0) AS n
INTO dbo.Numbers
FROM Nums AS N
WHERE N.n >= 1
AND N.n <= 10 * 1000 * 1000
OPTION (MAXDOP 1);
GO
ALTER TABLE dbo.Numbers
ADD CONSTRAINT PK_Numbers_n
PRIMARY KEY CLUSTERED (n)
WITH (SORT_IN_TEMPDB = ON, MAXDOP = 1, FILLFACTOR = 100);
```

Update the statistics to be sampled:

```
UPDATE STATISTICS dbo.Numbers;
```

Trivial query (request an estimated plan only):

```
SELECT * 
FROM dbo.Numbers AS N1
WHERE EXISTS
(
    SELECT * 
    FROM dbo.Numbers AS N2
    WHERE N2.n = N1.n
)
OPTION (QUERYTRACEON 3604, QUERYTRACEON 2363);
```

The trace output includes:

~~~none
Begin selectivity computation

Input tree:

  LogOp_LeftSemiJoin
      CStCollBaseTable(ID=1, CARD=1e+007 TBL: dbo.Numbers AS TBL: N1)
      CStCollBaseTable(ID=2, CARD=1e+007 TBL: dbo.Numbers AS TBL: N2)
      ScaOp_Comp x_cmpEq
          ScaOp_Identifier QCOL: [N2].n
          ScaOp_Identifier QCOL: [N1].n

Plan for computation:
  CSelCalcExpressionComparedToExpression( QCOL: [N1].n x_cmpEq QCOL: [N2].n )

Loaded histogram for column QCOL: [N1].n from stats with id 1
	*** WARNING: badly-formed histogram ***

Selectivity: 1
~~~

Validated on Microsoft SQL Server 2016 (SP2-CU14).

The statistics histogram obtained via:

```
DBCC SHOW_STATISTICS 
    (Numbers, [PK_Numbers_n])
    WITH HISTOGRAM;
```

has entries like:

[![2016 histogram][2]][2]

on SQL Server 2017 the histogram is:

[![2017 histogram][3]][3]

Note the `EQ_ROWS` steps are pre-scaled rather than relying on the CE to do it.

  [1]: https://support.microsoft.com/en-us/help/3202425/improvement-improves-the-query-performance-for-sql-server-2016-by-chan
  [2]: https://i.stack.imgur.com/nBgVs.png
  [3]: https://i.stack.imgur.com/t4w5s.png

This room is for discussion about this question.

Once logged in you can direct comments to any contributor here.

Enter question or answer id or url (and optionally further answer ids/urls from the same question) from

Separate each id/url with a space. No need to list your own answers; they will be imported automatically.