SARGable WHERE clause for two date columns

sql-server add tag

meme (imported from SE)

I have what is, to me, an interesting question on SARGability. In this case, it's about using a predicate on the difference between two date columns. Here's the setup:

    USE [tempdb]
    SET NOCOUNT ON	
    
    IF OBJECT_ID('tempdb..#sargme') IS NOT NULL
    BEGIN
    DROP TABLE #sargme
    END

    SELECT TOP 1000
    IDENTITY (BIGINT, 1,1) AS ID,
    CAST(DATEADD(DAY, [m].[severity] * -1, GETDATE()) AS DATE) AS [DateCol1],
    CAST(DATEADD(DAY, [m].[severity], GETDATE()) AS DATE) AS [DateCol2]
    INTO #sargme
    FROM sys.[messages] AS [m]
    
    ALTER TABLE [#sargme] ADD CONSTRAINT [pk_whatever] PRIMARY KEY CLUSTERED ([ID])
    CREATE NONCLUSTERED INDEX [ix_dates] ON [#sargme] ([DateCol1], [DateCol2])

What I'll see pretty frequently, is something like this:

    /*definitely not sargable*/
    SELECT
        * ,
        DATEDIFF(DAY, [s].[DateCol1], [s].[DateCol2])
    FROM
        [#sargme] AS [s]
    WHERE
        DATEDIFF(DAY, [s].[DateCol1], [s].[DateCol2]) >= 48;

...which definitely isn't SARGable. It results in an index scan, reads all 1000 rows, no good. Estimated rows stink. You'd never put this in production.

[![No sir, I didn't like it.][1]][1]

It would be nice if we could materialize CTEs, because that would help us make this, well, more SARGable-er, technically speaking. But no, we get the same execution plan as up top.

    /*would be nice if it were sargable*/
    WITH    [x] AS ( SELECT
                    * ,
                    DATEDIFF(DAY, [s].[DateCol1], [s].[DateCol2]) AS [ddif]
                   FROM
                    [#sargme] AS [s])
         SELECT
            *
         FROM
            [x]
         WHERE
            [x].[ddif] >= 48;

And of course, since we are not using constants, this code changes nothing, and is not even half SARGable. No fun. Same execution plan. 

    /*not even half sargable*/
    SELECT
        * ,
        DATEDIFF(DAY, [s].[DateCol1], [s].[DateCol2])
    FROM
        [#sargme] AS [s]
    WHERE
        [s].[DateCol2] >= DATEADD(DAY, 48, [s].[DateCol1])

If you're feeling lucky, and you're obeying all the ANSI SET options in your connection strings, you could add a computed column, and search on it...

    ALTER TABLE [#sargme] ADD [ddiff] AS 
    DATEDIFF(DAY, DateCol1, DateCol2) PERSISTED

    CREATE NONCLUSTERED INDEX [ix_dates2] ON [#sargme] ([ddiff], [DateCol1], [DateCol2])
    
    SELECT [s].[ID] ,
           [s].[DateCol1] ,
           [s].[DateCol2]
    FROM [#sargme] AS [s]
    WHERE [ddiff] >= 48

This will get you an index seek with three queries. The odd man out is where we add 48 days to DateCol1. The query with `DATEDIFF` in the `WHERE` clause, the `CTE`, and the final query with a predicate on the computed column all give you a much nicer plan with much nicer estimates, and all that.

[![I could live with this.][2]][2]

Which brings me to the question: in a single query, is there a SARGable way to perform this search? 

No temp tables, no table variables, no altering the table structure, and no views. 

I'm fine with self-joins, CTEs, subqueries, or multiple passes over the data. Can work with any version of SQL Server.

Avoiding the computed column is an artificial limitation because I'm more interested in a query solution than anything else.

  [1]: https://i.stack.imgur.com/vvmAp.jpg
  [2]: https://i.stack.imgur.com/noBNR.jpg
  [3]: http://blogfabiano.com/2012/07/03/statistics-used-in-a-cached-query-plan/
  [4]: https://i.stack.imgur.com/80oNF.png

Top Answer

Paul White (imported from SE)

I know this is not the answer you want, but an **indexed computed column** is usually the right solution for this type of problem.

It:

* makes the predicate an indexable expression
* allows automatic statistics to be created for better cardinality estimation
* does not *need* to take any space in the base table

To be clear on that last point, the computed column is **not required to be persisted** in this case:

    -- Note: not PERSISTED, metadata change only
    ALTER TABLE #sargme
    ADD DayDiff AS DATEDIFF(DAY, DateCol1, DateCol2);
    
    -- Index the expression
    CREATE NONCLUSTERED INDEX index_name
    ON #sargme (DayDiff)
    INCLUDE (DateCol1, DateCol2);

Now the query:

    SELECT
        S.ID,
        S.DateCol1,
        S.DateCol2,
        DATEDIFF(DAY, S.DateCol1, S.DateCol2)
    FROM
        #sargme AS S
    WHERE
        DATEDIFF(DAY, S.DateCol1, S.DateCol2) >= 48;

...gives the following *trivial* plan:

[![Execution plan][1]][1]

If you have connections using the wrong set options, you could create a regular column and maintain the computed value using triggers.

All this only really matters (code challenge aside) if there's a real problem to solve, of course, as Aaron says in [his answer][2].

This is fun to think about, but I don't know any way to achieve what you want reasonably given the constraints in the question. It seems like any optimal solution would require a new data structure of some type; the closest we have being the 'function index' approximation provided by an index on a non-persisted computed column as above.

  [1]: https://i.stack.imgur.com/z2rs7.png
  [2]: https://topanswers.xyz/databases?q=267#a220

Answer #2

aaron bertrand (imported from SE)

I tried a bunch of wacky variations, but didn't find any version better than one of yours. The main problem is that your index looks like this in terms of how date1 and date2 are sorted together. The first column is going to be in a nice shelved line while the gap between them is going to be very jagged. You want this to look more like a funnel than the way it really will:

    Date1    Date2
    -----    -------
    *             *
    *             *
    *              *
     *       * 
     *        *
     *         *
      *      *
      *           *

There's not really any way I can think of to make that seekable for a certain delta (or range of deltas) between the two points. And I mean a single seek that's executed once + a range scan, not a seek that's executed for every row. That will involve a scan and/or a sort at some point, and these are things you want to avoid obviously. It's too bad you can't use expressions like `DATEADD`/`DATEDIFF` in filtered indexes, or perform any possible schema modifications that would allow a sort on the product of the date diff (like calculating the delta at insert/update time). As is, this seems to be one of those cases where a scan is actually the optimal retrieval method.

You said that this query was no fun, but if you look closer, this is by far the best one (and would be even better if you left out the compute scalar output):

    SELECT
        * ,
        DATEDIFF(DAY, [s].[DateCol1], [s].[DateCol2])
    FROM
        [#sargme] AS [s]
    WHERE
        [s].[DateCol2] >= DATEADD(DAY, 48, [s].[DateCol1])

The reason is that avoiding the `DATEDIFF` potentially shaves some CPU compared to a calculation against *only* the non-leading key column in the index, and also avoids some nasty implicit conversions to `datetimeoffset(7)` (don't ask me why those are there, but they are). Here is the `DATEDIFF` version:

``` none
<Predicate>
<ScalarOperator ScalarString=
"datediff(day,CONVERT_IMPLICIT(datetimeoffset(7),
[splunge].[dbo].[sargme].[DateCol1] as [s].[DateCol1],0),
CONVERT_IMPLICIT(datetimeoffset(7),[splunge].[dbo].[sargme].[DateCol2] 
as [s].[DateCol2],0))>=(48)">
```

And here's the one without `DATEDIFF`:

``` none
<Predicate>
<ScalarOperator ScalarString=
"[splunge].[dbo].[sargme].[DateCol2] as [s].[DateCol2]>=
dateadd(day,(48),[splunge].[dbo].[sargme].[DateCol1] as [s].[DateCol1])">
```

Also I found slightly better results in terms of duration when I changed the index to only *include* `DateCol2` (and when both indexes were present, SQL Server always chose the one with one key and one include column vs. multi-key). For this query, since we have to scan all rows to find the range anyway, there is no benefit to have the second date column as part of the key and sorted in any way. And while I know we can't get a seek here, there is something inherently good-feeling about *not* hindering the ability to get one by forcing calculations against the leading key column, and only performing them against secondary or included columns.

If it were me, and I gave up on finding the sargable solution, I know which one I would choose - the one that makes SQL Server do the least amount of work (even if the delta is almost nonexistent). Or better yet I would relax my restrictions about schema change and the like.

And how much all of that matters? I don't know. I made the table 10 million rows and all of the above query variations still completed in under a second. And this is on a VM on a laptop (granted, with SSD).

Answer #3

daniel hutmacher (imported from SE)

Risking ridicule from some of the biggest names in the SQL Server community, I'm going to stick my neck out and say, nope.

In order for your query to be SARGable, you'd have to basically construct a query that can pinpoint a starting row in a *range of consecutive rows* in an index. With the index `ix_dates`, the rows are not ordered by the date difference between `DateCol1` and `DateCol2`, so your target rows could be spread out anywhere in the index.

Self-joins, multiple passes, etc. all have in common that they include at least one Index Scan, although a (nested loop) join may well use an Index Seek. But I can't see how it would be possible to eliminate the Scan.

As for getting more accurate row estimates, there are no statistics on the date difference.

The following, fairly ugly recursive CTE construct does technically eliminate scanning the whole table, although it introduces a Nested Loop Join and a (potentially very large) number of Index Seeks.

    DECLARE @from date, @count int;
    SELECT TOP 1 @from=DateCol1 FROM #sargme ORDER BY DateCol1;
    SELECT TOP 1 @count=DATEDIFF(day, @from, DateCol1) FROM #sargme WHERE DateCol1<=DATEADD(day, -48, {d '9999-12-31'}) ORDER BY DateCol1 DESC;
    
    WITH cte AS (
        SELECT 0 AS i UNION ALL
        SELECT i+1 FROM cte WHERE i<@count)
    
    SELECT b.*
    FROM cte AS a
    INNER JOIN #sargme AS b ON
        b.DateCol1=DATEADD(day, a.i, @from) AND
        b.DateCol2>=DATEADD(day, 48+a.i, @from)
    OPTION (MAXRECURSION 0);

It creates an Index Spool containing every `DateCol1` in the table, then performs an Index Seek (range scan) for each of those `DateCol1` and `DateCol2` that are at least 48 days forward.

More IOs, slightly longer execution time, row estimate is still way off, and zero chance of parallelization because of the recursion: I'm guessing this query could possibly be useful if you have a very large number of values within relatively few distinct, consecutive `DateCol1` (keeping the number of Seeks down).

[![Crazy recursive CTE query plan][1]][1]

  [1]: https://i.stack.imgur.com/ReyHi.png

Answer #4

revisions (imported from SE)

*Answer originally added by the question author as an edit to the question*

---

After letting this sit for a bit, and some really smart people chiming in, my initial thought on this seems correct: there's no sane and SARGable way to write this query without adding a column, either computed, or maintained via some other mechanism, namely triggers.

I did try a few other things, and I have some other observations that may or may not be interesting to anyone reading.

First, re-running the setup using a regular table rather than a temp table

 - Even though I know their reputation, I wanted to try multi-column statistics out. They were useless.
 - I wanted to see which statistics were used

Here's the new setup:

    USE [tempdb]
    SET NOCOUNT ON	
    
    DBCC FREEPROCCACHE
    
    IF OBJECT_ID('tempdb..sargme') IS NOT NULL
    BEGIN
    DROP TABLE sargme
    END
    
    SELECT TOP 1000
    IDENTITY (BIGINT, 1,1) AS ID,
    CAST(DATEADD(DAY, [m].[severity] * -1, GETDATE()) AS DATE) AS [DateCol1],
    CAST(DATEADD(DAY, [m].[severity], GETDATE()) AS DATE) AS [DateCol2]
    INTO sargme
    FROM sys.[messages] AS [m]
    
    ALTER TABLE [sargme] ADD CONSTRAINT [pk_whatever] PRIMARY KEY CLUSTERED ([ID])
    CREATE NONCLUSTERED INDEX [ix_dates] ON [sargme] ([DateCol1], [DateCol2])
    
    CREATE STATISTICS [s_sargme] ON [sargme] ([DateCol1], [DateCol2])

Then, running the first query, it uses the ix_dates index, and scans, just like before. No change here. This seems redundant, but stick with me.

    SELECT
        * ,
        DATEDIFF(DAY, [s].[DateCol1], [s].[DateCol2])
    FROM
        [sargme] AS [s]
    WHERE
        DATEDIFF(DAY, [s].[DateCol1], [s].[DateCol2]) >= 48

Run the CTE query again, still the same...

    WITH    [x] AS ( SELECT
                    * ,
                    DATEDIFF(DAY, [s].[DateCol1], [s].[DateCol2]) AS [ddif]
                   FROM
                    [sargme] AS [s])
         SELECT
            *
         FROM
            [x]
         WHERE
            [x].[ddif] >= 48;

Alright! Run the not-even-half-sargable query again:

    SELECT
        * ,
        DATEDIFF(DAY, [s].[DateCol1], [s].[DateCol2])
    FROM
        [sargme] AS [s]
    WHERE
        [s].[DateCol2] >= DATEADD(DAY, 48, [s].[DateCol1])

Now add the computed column, and re-run all three, along with the query that hits the computed column:

    ALTER TABLE [sargme] ADD [ddiff] AS 
    DATEDIFF(DAY, DateCol1, DateCol2) PERSISTED
    
    CREATE NONCLUSTERED INDEX [ix_dates2] ON [sargme] ([ddiff], [DateCol1], [DateCol2])
    
    SELECT [s].[ID] ,
           [s].[DateCol1] ,
           [s].[DateCol2]
    FROM [sargme] AS [s]
    WHERE [ddiff] >= 48

If you stuck with me to here, thanks. This is the interesting observation portion of the post.

Running a query with an undocumented trace flag by [Fabiano Amorim][3] to see which statistics each query used is pretty cool. Seeing that _no plan touched a statistics object_ until the computed column was created and indexed seemed odd.

[![What the bloodclot][4]][4]

Heck, even the query that hit the computed column ONLY didn't touch a statistics object until I ran it a few times and it got simple parameterization. So even though they all initially scanned the ix_dates index, they used hard-coded cardinality estimates (30% of the table) rather than any statistics object available to them. 

One other point that raised an eyebrow over here is that when I added only the nonclustered index, the query plans all scanned the HEAP, rather than use the nonclustered index on both date columns.

Thanks to everyone who responded. You're all wonderful.

  [1]: https://i.stack.imgur.com/vvmAp.jpg
  [2]: https://i.stack.imgur.com/noBNR.jpg
  [3]: http://blogfabiano.com/2012/07/03/statistics-used-in-a-cached-query-plan/
  [4]: https://i.stack.imgur.com/80oNF.png

4 Answers