SQL Server splits A <> B into A < B OR A > B, yielding strange results if B is non-deterministic - Databases

TopAnswers Databases

Meta

Databases

TeX

Code Golf

APL

C++

.net

db<>fiddle

Java

*nix

PHP

PowerShell

Python

Rust

टेक्-मराठी

Typst

Web Client Dev

Web Server Dev

SQL Server splits A <> B into A < B OR A > B, yielding strange results if B is non-deterministic

sql-server add tag

heinzi (imported from SE)

We have encountered an interesting issue with SQL Server. Consider the following repro example:

    CREATE TABLE #test (s_guid uniqueidentifier PRIMARY KEY);
    INSERT INTO #test (s_guid) VALUES ('7E28EFF8-A80A-45E4-BFE0-C13989D69618');
    
    SELECT s_guid FROM #test
    WHERE s_guid = '7E28EFF8-A80A-45E4-BFE0-C13989D69618'
      AND s_guid <> NEWID();
    
    DROP TABLE #test;

[fiddle](http://rextester.com/REASM51008)

Please forget for a moment that the `s_guid <> NEWID()` condition seems entirely useless - this is just a minimal repro example. Since the probability of `NEWID()` matching some given constant value is extremely small, it should evaluate to TRUE every time.

But it doesn't. Running this query *usually* returns 1 row, but *sometimes* (quite frequently, more than 1 time out of 10) returns 0 rows. I have reproduced it with SQL Server 2008 on my system, and you can reproduce it on-line with the fiddle linked above (SQL Server 2014).

Looking at the execution plan reveals that the query analyzer apparently splits the condition into `s_guid < NEWID() OR s_guid > NEWID()`:

[![query plan screenshot][1]][1]

...which completely explains why it fails sometimes (if the first generated ID is smaller and the second one larger than the given ID).

Is SQL Server allowed to evaluate `A <> B` as `A < B OR A > B`, even if one of the expressions is non-deterministic? If yes, where is it documented? Or did we find a bug?

Interestingly, `AND NOT (s_guid = NEWID())` yields the same execution plan (and the same random result).

We found this issue when a developer wanted to optionally exclude a particular row and used:

    s_guid <> ISNULL(@someParameter, NEWID())

as a "shortcut" for: 

    (@someParameter IS NULL OR s_guid <> @someParameter)

I am looking for documentation and/or confirmation of a bug. The code is not all that relevant so workarounds are not required.

  [1]: https://i.stack.imgur.com/OwK4n.png

Top Answer

Paul White (imported from SE)

> Is SQL Server allowed to evaluate `A <> B` as `A < B OR A > B`, even if one of the expressions is non-deterministic?

This is a somewhat controversial point, and the answer is a qualified "yes".

The best discussion I am aware of was given in answer to Itzik Ben-Gan's Connect bug report [Bug with NEWID and Table Expressions][1], which was closed as won't fix. Connect has since been retired, so the link there is to a web archive. Sadly, a lot of useful material was lost (or made harder to find) by the demise of Connect. Anyway, the most useful quotes from Jim Hogg of Microsoft there are:

>This hits to the very heart of the issue - is optimization allowed to change a program's semantics? Ie: if a program yields certain answers, but runs slowly, is it legitimate for a Query Optimizer make that program run faster, yet also change the results given?
>
>Before shouting "NO!" (my own personal inclination too :-), consider: the good news is that, in 99% of cases, the answers ARE the same. So Query Optimization is a clear win. The bad news is that, if the query contains side-effecting code, then different plans CAN indeed yield different results. And NEWID() is one such side-effecting (non-deterministic) 'function' that exposes the difference. [Actually, if you experiment, you can devise others - for example, short-circuit evaluation of AND clauses: make the second clause throw an arithmetic divide-by-zero - different optimizations may execute that second clause BEFORE the first clause] This reflects Craig's explanation, elsewhere in this thread, that SqlServer does not guarantee when scalar operators are executed.
>
>So, we have a choice: if we want to guarantee a certain behavior in the presence of non-deterministic (side-effecting) code - so that results of JOINs, for example, follow the semantics of a nested-loop execution - then we can use appropriate OPTIONs to force that behavior - as UC points out. But the resulting code will run slow - that's the cost of, in effect, hobbling the Query Optimizer.
>
>All that said, we are moving the Query Optimizer in the direction of "as expected" behavior for NEWID() - trading off performance for "results as expected".

One example of the changing of behaviour in this regard over time is [NULLIF works incorrectly with non-deterministic functions such as RAND()][2]. There are also other similar cases using e.g. `COALESCE` with a subquery that can produce unexpected results, and which are also being addressed gradually.

Jim continues:

>Closing the loop . . . I've discussed this question with the Dev team. And eventually we have decided not to change current behavior, for the following reasons:
>
>1) The optimizer does not guarantee timing or number of executions of scalar functions. This is a long-estabilished tenet. It's the fundamental 'leeway' tha allows the optimizer enough freedom to gain significant improvements in query-plan execution.
>
>2) This "once-per-row behavior" is not a new issue, although it's not widely discussed. We started to tweak its behavior back in the Yukon release. But it's quite hard to pin down precisely, in all cases, exactly what it means! For example, does it a apply to interim rows calculated 'on the way' to the final result? - in which case it clearly depends on the plan chosen. Or does it apply only to the rows that will eventually appear in the completed result? - there's a nasty recursion going on here, as I'm sure you'll agree!
>
>3) As I mentioned earlier, we default to "optimize performance" - which is good for 99% of cases. The 1% of cases where it might change results are fairly easy to spot - side-effecting 'functions' such as NEWID - and easy to 'fix' (trading perf, as a consequence). This default to "optimize performance" again, is long-established, and accepted. (Yes, it's not the stance chosen by compilers for conventional programming languages, but so be it).
>
>So, our recommendations are:
>
>a) Avoid reliance on non-guaranteed timing and number-of-executions semantics.  
b) Avoid using NEWID() deep in table expressions.  
c) Use OPTION to force a particular behavior (trading perf)
>
>Hope this explanation helps clarify our reasons for closing this bug as "won't fix".

---

>Interestingly, `AND NOT (s_guid = NEWID())` yields the same execution plan

This is a consequence of normalization, which happens very early during query compilation. Both expressions compile to exactly the same normalized form, so the same execution plan is produced.

  [1]: https://web.archive.org/web/20160626085155/https://connect.microsoft.com/SQLServer/feedbackdetail/view/350485/bug-with-newid-and-table-expressions
  [2]: https://feedback.azure.com/forums/908035-sql-server/suggestions/32886796-nullif-works-incorrectly-with-non-deterministic-fu

1 Answer