Cardinality Estimate for LIKE operator (Local Variables) - Databases

TopAnswers Databases

Meta

Databases

TeX

Code Golf

APL

C++

.net

db<>fiddle

Java

*nix

PHP

PowerShell

Python

Rust

टेक्-मराठी

Typst

Web Client Dev

Web Server Dev

Cardinality Estimate for LIKE operator (Local Variables)

sql-server sql-server-2014 add tag

fza (imported from SE)

I was under the impression that when using the `LIKE` operator in all optimise for unknown scenarios both the legacy and new CEs use a 9% estimate (assuming that relevant statistics are available and the query optimiser doesn't have to resort to selectivity guesses).

When executing the below query against the credit database I get different estimates under the different CEs.  Under the new CE I receive an estimate of 900 rows which I was expecting, under the legacy CE I receive an estimate of 241.416 and I can't figure out how this estimate is derived.  Is anyone able to shed any light?

    -- New CE (Estimate = 900)
    DECLARE @LastName VARCHAR(15) = 'BA%'
    SELECT * FROM [Credit].[dbo].[member]
    WHERE [lastname] LIKE @LastName;
    
    -- Forcing Legacy CE (Estimate = 241.416)
    DECLARE @LastName VARCHAR(15) = 'BA%'
    SELECT * FROM [Credit].[dbo].[member]
    WHERE [lastname] LIKE @LastName
    OPTION (
    QUERYTRACEON 9481,
    QUERYTRACEON 9292,
    QUERYTRACEON 9204,
    QUERYTRACEON 3604
    );

In my scenario, I already have the credit database set to compatibility level 120, hence why in the second query I'm using trace flags to force the legacy CE and to also provide information on what statistics are used/considered by the query optimiser.  I can see the column statistics on 'lastname' are being used but I still can't work out how the estimate of 241.416 is derived.

I couldn't find anything online other than [this Itzik Ben-Gan article][1], which states "When using the LIKE predicate in all optimize for unknown scenarios both the legacy and new CEs use a 9 percent estimate.". The information in that post would appear to be incorrect.

  [1]: http://sqlmag.com/sql-server/sql-server-query-optimization-no-unknown-unknowns

Top Answer

Paul White (imported from SE)

The guess for `LIKE` *in your case* is based on:

* `G`: The standard 9% guess (`sqllang!x_Selectivity_Like`)
* `M`: A factor of 6 (magic number)
* `D`: Average data length in bytes (from statistics), rounded down to integer

Specifically, `sqllang!CCardUtilSQL7::ProbLikeGuess` uses:

> `Selectivity (S) = G / M * LOG(D)`

Notes:

* The `LOG(D)` term is omitted if `D` is between 1 and 2.
* If `D` is less than 1 (including for missing or `NULL` statistics):  
`D = FLOOR(0.5 * maximum column byte length)`

This sort of quirkiness and complexity is quite typical of the original CE.

In the question example, the average length is 5 (5.6154 from `DBCC SHOW_STATISTICS` rounded down):

```none
Estimate = 10,000 * (0.09 / 6 * LOG(5)) = <b>241.416</b>
```

Other example values:

```none
 D  = Estimate using formula for S
 15 = 406.208
 14 = 395.859
 13 = 384.742
 12 = 372.736
 11 = 359.684
 10 = 345.388
 09 = 329.584
 08 = 311.916
 07 = 291.887
 06 = 268.764
 05 = 241.416
 04 = 207.944
 03 = 164.792
 02 = 150.000 (LOG not used)
 01 = 150.000 (LOG not used)
 00 = 291.887 (LOG 7) /* FLOOR(0.5 * 15) [15 since lastname is varchar(15)] */
```

### Test rig

    DECLARE
        @CharLength integer = 5, -- Set length here
        @Counter integer = 1;
    
    CREATE TABLE #T (c1 varchar(15) NULL);
    
    -- Add 10,000 rows
    SET NOCOUNT ON;
    SET STATISTICS XML OFF;
    
    BEGIN TRANSACTION;
    WHILE @Counter <= 10000
    BEGIN
        INSERT #T (c1) VALUES (REPLICATE('X', @CharLength));
        SET @Counter = @Counter + 1;
    END;
    COMMIT TRANSACTION;
    
    SET NOCOUNT OFF;
    SET STATISTICS XML ON;
    
    -- Test query
    DECLARE @Like varchar(15);
    SELECT * FROM #T AS T 
    WHERE T.c1 LIKE @Like;
    
    DROP TABLE #T;

1 Answer