Index on Persisted Computed column needs key lookup to get columns in the computed expression

sql-server sql-server-2008 add tag

GarethD (imported from SE)

I have a persisted computed column on a table which is simply made up concatenated columns, e.g.

	CREATE TABLE dbo.T 
	(	
		ID INT IDENTITY(1, 1) NOT NULL CONSTRAINT PK_T_ID PRIMARY KEY,
		A VARCHAR(20) NOT NULL,
		B VARCHAR(20) NOT NULL,
		C VARCHAR(20) NOT NULL,
		D DATE NULL,
		E VARCHAR(20) NULL,
		Comp AS A + '-' + B + '-' + C PERSISTED NOT NULL 
	);

In this `Comp` is not unique, and D is the valid from date of each combination of `A, B, C`, therefore I use the following query to get the end date for each `A, B, C` (basically the next start date for the same value of Comp):

	SELECT	t1.ID,
			t1.Comp,
			t1.D,
			D2 = (	SELECT	TOP 1 t2.D
					FROM	dbo.T t2
					WHERE	t2.Comp = t1.Comp
					AND		t2.D > t1.D
				    ORDER BY t2.D
				)
	FROM	dbo.T t1
	WHERE	t1.D IS NOT NULL -- DON'T CARE ABOUT INACTIVE RECORDS
	ORDER BY t1.Comp;

I then added an index to the computed column to assist in this query (and also others):

	CREATE NONCLUSTERED INDEX IX_T_Comp_D ON dbo.T (Comp, D) WHERE D IS NOT NULL;

The query plan however surprised me. I would have thought that since I have a where clause stating that `D IS NOT NULL` and I am sorting by `Comp`, and not referencing any column outside of the index that the index on the computed column could be used to scan t1 and t2, but I saw a clustered index scan. 

![enter image description here][1]

So I forced the use of this index to see if it yielded a better plan:

	SELECT	t1.ID,
			t1.Comp,
			t1.D,
			D2 = (	SELECT	TOP 1 t2.D
					FROM	dbo.T t2
					WHERE	t2.Comp = t1.Comp
					AND		t2.D > t1.D
				    ORDER BY t2.D
				)
	FROM	dbo.T t1 WITH (INDEX (IX_T_Comp_D))
	WHERE	t1.D IS NOT NULL
	ORDER BY t1.Comp;

Which gave this plan

![enter image description here][2]

This shows that a Key lookup is being used, the details of which are:

![enter image description here][3]

Now, according to the SQL-Server documentation:

>You can create an index on a computed column that is defined with a deterministic, but imprecise, expression if the column is marked PERSISTED in the CREATE TABLE or ALTER TABLE statement. This means that the Database Engine stores the computed values in the table, and updates them when any other columns on which the computed column depends are updated. The Database Engine uses these persisted values when it creates an index on the column, and when the index is referenced in a query. This option enables you to create an index on a computed column when Database Engine cannot prove with accuracy whether a function that returns computed column expressions, particularly a CLR function that is created in the .NET Framework, is both deterministic and precise.

So if, as the docs say *"the Database Engine stores the computed values in the table"*, and the value is also being stored in my index, why is a Key Lookup required to get A, B and C when they are not referenced in the query at all? I assume they are being used to calculate Comp, but why? Also, why can the query use the index on `t2`, but not on `t1`?

*[Queries and DDL on SQL Fiddle](http://sqlfiddle.com/#!3/e47a1/2)*


*N.B. I have tagged SQL Server 2008 because this is the version that my main problem is on, but I also get the same behaviour in 2012.*

  [1]: https://i.stack.imgur.com/NrGjj.png
  [2]: https://i.stack.imgur.com/AkBTW.png
  [3]: https://i.stack.imgur.com/ygAJv.png

Top Answer

wBob (imported from SE)

Although this might be a bit of a co-incidence due to the artificial nature of your test data, being as you mentioned SQL 2012 I tried a rewrite:

    SELECT	ID,
    		Comp,
    		D,
    		D2 = LEAD(D) OVER(PARTITION BY COMP ORDER BY D)
    FROM	dbo.T 
    WHERE	D IS NOT NULL
    ORDER BY Comp;

This yielded a nice low-cost plan using your index and with significantly lower reads than the other options (and the same results for your test data).

![Plan Explorer costs for four options: Original; original with hint; outer apply and Lead][1]

I suspect your real data is more complicated so there might be some scenarios where this query behaves semantically different to yours, but it does show sometimes the new features can make a real difference.

I did experiment with some more varied data and found some scenarios to match and some not:

    --Example 1: results matched
    TRUNCATE TABLE dbo.t
    
    -- Generate some more interesting test data
    ;WITH cte AS
    (
    SELECT TOP 1000 ROW_NUMBER() OVER ( ORDER BY ( SELECT 1 ) ) rn
    FROM master.sys.columns c1
    	CROSS JOIN master.sys.columns c2
    	CROSS JOIN master.sys.columns c3
    )
    INSERT T (A, B, C, D)
    SELECT	'A' + CAST( a.rn AS VARCHAR(5) ),
    		'B' + CAST( a.rn AS VARCHAR(5) ),
    		'C' + CAST( a.rn AS VARCHAR(5) ),
    		DATEADD(DAY, a.rn + b.rn, '1 Jan 2013')
    FROM cte a
    	CROSS JOIN cte b
    WHERE a.rn % 3 = 0
     AND b.rn % 5 = 0
    ORDER BY 1, 2, 3
    GO
    
    
    -- Original query
    SELECT	t1.ID,
    		t1.Comp,
    		t1.D,
    		D2 = (	SELECT	TOP 1 D
    				FROM	dbo.T t2
    				WHERE	t2.Comp = t1.Comp
    				AND		t2.D > t1.D
                  	ORDER BY D
    			)
    INTO #tmp1
    FROM	dbo.T t1 
    WHERE	t1.D IS NOT NULL
    ORDER BY t1.Comp;
    GO
    
    SELECT	ID,
    		Comp,
    		D,
    		D2 = LEAD(D) OVER(PARTITION BY COMP ORDER BY D)
    INTO #tmp2
    FROM	dbo.T 
    WHERE	D IS NOT NULL
    ORDER BY Comp;
    GO
    
    
    -- Checks ...
    SELECT * FROM #tmp1
    EXCEPT
    SELECT * FROM #tmp2
    
    SELECT * FROM #tmp2
    EXCEPT
    SELECT * FROM #tmp1


    Example 2: results did not match
    TRUNCATE TABLE dbo.t
    
    -- Generate some more interesting test data
    ;WITH cte AS
    (
    SELECT TOP 1000 ROW_NUMBER() OVER ( ORDER BY ( SELECT 1 ) ) rn
    FROM master.sys.columns c1
    	CROSS JOIN master.sys.columns c2
    	CROSS JOIN master.sys.columns c3
    )
    INSERT T (A, B, C, D)
    SELECT	'A' + CAST( a.rn AS VARCHAR(5) ),
    		'B' + CAST( a.rn AS VARCHAR(5) ),
    		'C' + CAST( a.rn AS VARCHAR(5) ),
    		DATEADD(DAY, a.rn, '1 Jan 2013')
    FROM cte a
    
    -- Add some more data
    INSERT dbo.T (A, B, C, D)
    SELECT A, B, C, D 
    FROM dbo.T
    WHERE DAY(D) In ( 3, 7, 9 )
    
    
    INSERT dbo.T (A, B, C, D)
    SELECT A, B, C, DATEADD( day, 1, D )
    FROM dbo.T
    WHERE DAY(D) In ( 12, 13, 17 )
    
    
    SELECT * FROM #tmp1
    EXCEPT
    SELECT * FROM #tmp2
    
    SELECT * FROM #tmp2
    EXCEPT
    SELECT * FROM #tmp1
    
    SELECT * FROM #tmp2
    INTERSECT
    SELECT * FROM #tmp1
    
    
    select * from #tmp1
    where comp = 'A2-B2-C2'
    
    select * from #tmp2
    where comp = 'A2-B2-C2'


  [1]: https://i.stack.imgur.com/7tpdL.png

Answer #2

Paul White (imported from SE)

>*Why is a Key Lookup required to get A, B and C when they are not referenced in the query at all? I assume they are being used to calculate Comp, but why?*

Columns `A, B, and C` *are* referenced in the query plan - they are used by the seek on `T2`.

>*Also, why can the query use the index on t2, but not on t1?*

The optimizer decided that scanning the clustered index was cheaper than scanning the filtered nonclustered index and then performing a lookup to retrieve the values for columns A, B, and C.

### Explanation

The real question is why the optimizer felt the need to retrieve A, B, and C for the index seek at all. We would expect it to read the `Comp` column using a nonclustered index scan, and then perform a seek on the same index (alias T2) to locate the Top 1 record.

The query optimizer expands computed column references before optimization begins, to give it a chance to assess the costs of various query plans. For some queries, expanding the definition of a computed column allows the optimizer to find more efficient plans.

When the optimizer encounters a correlated subquery, it attempts to 'unroll it' to a form it finds easier to reason about. If it cannot find a more effective simplification, it resorts to rewriting the correlated subquery as an apply (a correlated join):

![Apply rewrite][1]

It just so happens that this apply unrolling puts the logical query tree into a form that does not work well with *project normalization* (a later stage that looks to match general expressions to computed columns, among other things).

In your case, the way the query is written interacts with internal details of the optimizer such that the expanded expression definition is not matched back to the computed column, and you end up with a seek that references columns `A, B, and C` instead of the computed column, `Comp`. This is the root cause.

### Workaround

One idea to workaround this side-effect is to write the query as an apply manually:

    SELECT
        T1.ID,
        T1.Comp,
        T1.D,
        CA.D2
    FROM dbo.T AS T1
    CROSS APPLY
    (  
        SELECT TOP (1)
            D2 = T2.D
        FROM dbo.T AS T2
        WHERE
            T2.Comp = T1.Comp
            AND T2.D > T1.D
        ORDER BY
            T2.D ASC
    ) AS CA
    WHERE
        T1.D IS NOT NULL -- DON'T CARE ABOUT INACTIVE RECORDS
    ORDER BY
        T1.Comp;

Unfortunately, this query will not use the filtered index as we would hope either. The inequality test on column `D` inside the apply rejects `NULLs`, so the apparently redundant predicate `WHERE T1.D IS NOT NULL` is optimized away.

Without that explicit predicate, the filtered index matching logic decides it cannot use the filtered index. There are a number of ways to work around this second side-effect, but the easiest is probably to change the cross apply to an outer apply (mirroring the logic of the rewrite the optimizer performed earlier on the correlated subquery):

    SELECT
        T1.ID,
        T1.Comp,
        T1.D,
        CA.D2
    FROM dbo.T AS T1
    OUTER APPLY
    (  
        SELECT TOP (1)
            D2 = T2.D
        FROM dbo.T AS T2
        WHERE
            T2.Comp = T1.Comp
            AND T2.D > T1.D
        ORDER BY
            T2.D ASC
    ) AS CA
    WHERE
        T1.D IS NOT NULL -- DON'T CARE ABOUT INACTIVE RECORDS
    ORDER BY
        T1.Comp;

Now the optimizer does not need to use the apply rewrite itself (so the computed column matching works as expected) and the predicate is not optimized away either, so the filtered index can be used for both data access operations, and the seek uses the `Comp` column on both sides:

![Outer Apply Plan][2]

This would generally be preferred over adding A, B, and C as `INCLUDEd` columns in the filtered index, because it addresses the root cause of the problem, and does not require widening the index unnecessarily.

### Persisted computed columns

As a side note, it is not necessary to mark the computed column as `PERSISTED`, if you don't mind repeating its definition in a `CHECK` constraint:

    CREATE TABLE dbo.T 
    (	
    	ID integer IDENTITY(1, 1) NOT NULL,
    	A varchar(20) NOT NULL,
    	B varchar(20) NOT NULL,
    	C varchar(20) NOT NULL,
    	D date NULL,
    	E varchar(20) NULL,
    	Comp AS A + '-' + B + '-' + C,
    
        CONSTRAINT CK_T_Comp_NotNull
            CHECK (A + '-' + B + '-' + C IS NOT NULL),
    
        CONSTRAINT PK_T_ID 
            PRIMARY KEY (ID)
    );

    CREATE NONCLUSTERED INDEX IX_T_Comp_D
    ON dbo.T (Comp, D) 
    WHERE D IS NOT NULL;

The computed column is only required to be `PERSISTED` in this case if you want to use a `NOT NULL` constraint or to reference the `Comp` column directly (instead of repeating its definition) in a `CHECK` constraint.


  [1]: https://i.stack.imgur.com/BPYGL.png
  [2]: https://i.stack.imgur.com/J3aud.png

2 Answers