Why are (seemingly) suitable indexes not used on a LEFT JOIN with OR

sql-server sql-server-2019 add tag

searle1986 (imported from SE)

I have the following [fairly meaningless, just for the purpose of demonstration] query in the Stack Overflow database:

    SELECT	*
    FROM	Users u
    		LEFT JOIN Comments c
    		    ON u.Id = c.UserId OR
                   u.Id = c.PostId
    WHERE	u.DisplayName = 'alex'

The only index on the `Users` table is a clustered index on ID.

The `Comments` table has the following Non-Clustered indexes as well as Clustered Index on ID:

    CREATE INDEX IX_UserID ON Comments
    (
    	UserID,
    	PostID
    )

    CREATE INDEX IX_PostID ON Comments
    (
    	PostID,
    	UserID
    )

The estimated plan for the query is [here][1]:

I can see the first thing the optimizer will do is perform a CI scan on the users table to filter only those users where `DisplayName = Alex`, effectively doing this:

    SELECT	*
    FROM	Users u
    WHERE	u.DisplayName = 'alex'
    ORDER BY Id

and retreiving results as such:

[![enter image description here][2]][2]

Then it will scan the comments CI and for every row, look to see if the row satisfies the predicate 

    u.Id = c.UserId OR u.Id = c.PostId

Despite the two indexes, this CI scan is performed.

Wouldn't it be more efficient if the optimizer did a separate seek on each of the indexes in the Comments table above and join them together?

If I visualise what that would look like, in the screenshot above we can see the first result of the Users CI scan is ID 420

I can visualize what the `IX_UserID` Index looks like using 

    SELECT		UserID,
    			PostID
    FROM		Comments
    ORDER BY	UserID,
    			PostID

so if I seek to the rows for user ID 420 as an index seek would:

[![enter image description here][3]][3]

for every row where `UserID = 420`, I can look if  `u.Id = c.UserId OR u.Id = c.PostId` of, course they all match the `u.Id = c.UserId` part of our predicate, 

So for the second part of our index seek, we can seek through our index `IX_PostID` which can be visualised as follows:

    SELECT		PostID,
    			UserID
    FROM		Comments
    ORDER BY	PostID,
    			UserID 

If I seek to Post ID 420 I can see nothing is there:

[![enter image description here][4]][4]

So we then go back to the results of the CI scan, move to the next row (userId 447) and repeat the process.

The behaviour I have described above is possible using in a `WHERE` clause:

    SELECT		UserID,
    			PostID
    FROM		Comments
    WHERE		UserID = 420 OR PostID = 420
    ORDER BY	UserID,
    			PostID

[Plan here][5]

My question therefore is, why isn't an `OR` condition in a `JOIN` clause able to perform an index seek on appropriate indexes?

  [1]:https://www.brentozar.com/pastetheplan/?id=H1L-yr_rU
  [2]: https://i.stack.imgur.com/hHUi3.png
  [3]: https://i.stack.imgur.com/23B4z.png
  [4]: https://i.stack.imgur.com/WMoyX.png
  [5]: https://www.brentozar.com/pastetheplan/?id=HJk1HmP48

Top Answer

Paul White

@@@ answer 958

The optimizer doesn't always consider [index union plans](https://docs.microsoft.com/en-nz/archive/blogs/craigfr/index-union) because they are not commonly applicable, can be expensive to generate, and can lead to an explosion of the search space.

The heuristics that prevent the optimizer generating an *index union* plan e.g. via exploration rule `LeftSideJNtoIdxLookup` can be disabled with undocumented trace flag 8726.

Most often it will be more convenient to use the `FORCESEEK` hint, but this trace flag can be used in development or test to see if a good multi-index plan could be generated.

A plan generating an unwanted *index union* can be prevented from doing so with trace flag 8727.

Answer #2

Josh Darnell (imported from SE)

Rather than focusing on how to improve a query like this, which is what the other answers are doing, I'm going to try to answer the question being asked: *why* doesn't the optimizer produce a plan like the one you've described (that scans the Users table, and then seeks into the two indexes on the Comments table).

Here's your original query again (note I'm using `MAXDOP 2` just to simulate what I saw in your execution plans):

```
SELECT  *
FROM    Users u
        LEFT JOIN Comments c
            ON u.Id = c.UserId OR
               u.Id = c.PostId
WHERE   u.DisplayName = 'alex'
OPTION (MAXDOP 2);
```

And the plan:

[![Screenshot of original left join plan][1]][1]

- Scan of `dbo.Users` with residual predicate to get just the "alex" users
- For each of those users, scan the `dbo.Comments` table and filter matches in the join operator
- **Estimated cost: 293.161 optimizer units**

One attempt to get the plan you want would be to try and *force* a seek on the `dbo.Comments` table:

```
SELECT  *
FROM    Users u
        LEFT JOIN Comments c WITH (FORCESEEK)
            ON u.Id = c.UserId OR
               u.Id = c.PostId
WHERE   u.DisplayName = 'alex'
OPTION (MAXDOP 2);
```

The plans looks like this:

[![Screenshot of left join plan with hints][2]][2]

- scan of the `dbo.Users` table (with a residual predicate to only get users named "alex"), 
- seek into each of the two indexes to get the requested Id values (which are unioned together)
- followed up by a key lookup to get the rest of the columns (since we selected *)
- **Estimated cost: 5.98731 optimizer units**

So the answer is that the optimizer is definitely *capable* of producing such a plan.  And it doesn't seem to be a cost-based decision (the seek plan looks much cheaper).

My best guess is that this is just some kind of limitation in the optimizer's exploration process - it doesn't seem to favor converting a left join with an `OR` clause into an apply.   This is *really* unfortunate in this particular case, as performance is dismal in the scan plan (the query takes 45 seconds on my machine)  vs the apply plan (less than 1 second).

As [Rob Farley helpfully points out](https://topanswers.xyz/databases?q=815#a968), using `APPLY` directly (potentially with a `UNION` as well) is a better approach to get the plan you're looking for - both of those produce the "better" version of this plan (the `FORCESEEK` version).  I would say that "`OR` in an outer `JOIN`" is kind of a known anti-pattern, and should be avoided since it doesn't seem like the optimizer has great support for that type of query directly.

  [1]: https://i.stack.imgur.com/FQApZ.png
  [2]: https://i.stack.imgur.com/k1RaY.png

Answer #3

rob farley (imported from SE)

When you have a join, the Query Optimizer will consider how best to satisfy the predicates involved with the various join techniques. It doesn’t try to re-evaluate the query as if had been written with APPLY, which is what you kinda want here, where it would see the right hand side of the join as like a sub-query.

You can try this youself, by doing something like:

    SELECT  *
    FROM    Users u
        OUTER APPLY (
            SELECT *
            FROM Comments c
            WHERE u.Id = c.UserId 
            OR u.Id = c.PostId
        ) c
    WHERE   u.DisplayName = 'alex'

...but given than OR is annoyingly often not turned into a UNION, I’d prefer:

    SELECT  *
    FROM    Users u
        OUTER APPLY (
            SELECT *
            FROM Comments c
            WHERE u.Id = c.UserId 
            UNION
            SELECT *
            FROM Comments c
            WHERE u.Id = c.PostId
        ) c
    WHERE   u.DisplayName = 'alex'

3 Answers