Forcing Flow Distinct - Databases

TopAnswers Databases

Meta

Databases

TeX

Code Golf

APL

C++

.net

db<>fiddle

Java

*nix

PHP

PowerShell

Python

Rust

टेक्-मराठी

Typst

Web Client Dev

Web Server Dev

Forcing Flow Distinct

sql-server sql-server-2014 add tag

cory nelson (imported from SE)

I have a table like this:

	CREATE TABLE Updates
	(
		UpdateId INT NOT NULL IDENTITY(1,1) PRIMARY KEY,
		ObjectId INT NOT NULL
	)

Essentially tracking updates to objects with an increasing ID.

The consumer of this table will select a chunk of 100 distinct object IDs, ordered by `UpdateId` and starting from a specific `UpdateId`. Essentially, keeping track of where it left off and then querying for any updates.

I've found this to be an interesting optimization problem because I've only been able to generate a maximally optimal query plan by writing queries that *happen* to do what I want due to indexes, but do not *guarantee* what I want:

	SELECT DISTINCT TOP 100 ObjectId
	FROM Updates
	WHERE UpdateId > @fromUpdateId

Where `@fromUpdateId` is a stored procedure parameter.

With a plan of:

	SELECT <- TOP <- Hash match (flow distinct, 100 rows touched) <- Index seek

Due to the seek on the `UpdateId` index being used, the results are already nice and ordered from lowest to highest update ID like I want. And this generates a *flow distinct* plan, which is what I want. But the ordering obviously isn't guaranteed behavior, so I don't want to use it.

This trick also results in the same query plan (though with a redundant TOP):

	WITH ids AS
	(
		SELECT ObjectId
		FROM Updates
		WHERE UpdateId > @fromUpdateId
		ORDER BY UpdateId OFFSET 0 ROWS
	)
	SELECT DISTINCT TOP 100 ObjectId FROM ids

Though, I'm not sure (and suspect not) if this truly guarantees ordering.

One query I hoped SQL Server would be smart enough to simplify was this, but it ends up generating a very bad query plan:

	SELECT TOP 100 ObjectId
	FROM Updates
	WHERE UpdateId > @fromUpdateId
	GROUP BY ObjectId
	ORDER BY MIN(UpdateId)

With a plan of:

	SELECT <- Top N Sort <- Hash Match aggregate (50,000+ rows touched) <- Index Seek

I'm trying to find a way to generate an optimal plan with an index seek on `UpdateId` and a *flow distinct* to remove duplicate `ObjectId`s. Any ideas?

[Sample data](https://gist.github.com/anonymous/a05400310a127188b949aa25990d1c17) if you want it. Objects will rarely have more than one update, and should almost never have more than one within a set of 100 rows, which is why I'm after a *flow distinct*, unless there's something better I don't know of? However,  there is no guarantee that a single `ObjectId` won't have more than 100 rows in the table. The table has over 1,000,000 rows and is expected to grow rapidly.

Assume the user of this has another way to find the appropriate next `@fromUpdateId`. No need to return it in this query.

Top Answer

Paul White (imported from SE)

The SQL Server optimizer **cannot** produce the execution plan you are after with the guarantee you need, because the *Hash Match Flow Distinct* operator is not order-preserving.

> Though, I'm not sure (and suspect not) if this truly guarantees ordering.

You may *observe* order preservation in many cases, but this is an implementation detail; there is no guarantee, so you cannot rely on it. As always, presentation order can only be guaranteed by a top-level `ORDER BY` clause.

### Example

The script below shows that Hash Match Flow Distinct does not preserve order. It sets up the table in question with matching numbers 1-50,000 in both columns:

    IF OBJECT_ID(N'dbo.Updates', N'U') IS NOT NULL
        DROP TABLE dbo.Updates;
    GO
    CREATE TABLE Updates
    (
        UpdateId INT NOT NULL IDENTITY(1,1),
        ObjectId INT NOT NULL,
    
        CONSTRAINT PK_Updates_UpdateId PRIMARY KEY (UpdateId)
    );
    GO
    INSERT dbo.Updates (ObjectId)
    SELECT TOP (50000)
        ObjectId =
            ROW_NUMBER() OVER (
                ORDER BY C1.[object_id]) 
    FROM sys.columns AS C1
    CROSS JOIN sys.columns AS C2
    ORDER BY
        ObjectId;

The test query is:

    DECLARE @Rows bigint = 50000;
    
    -- Optimized for 1 row, but will be 50,000 when executed
    SELECT DISTINCT TOP (@Rows)
        U.ObjectId 
    FROM dbo.Updates AS U
    WHERE 
        U.UpdateId > 0
    OPTION (OPTIMIZE FOR (@Rows = 1));

The estimated plan shows an index seek and flow distinct:

[![Estimated plan][2]][2]

The output certainly seems ordered to start with:

[![Start of results][3]][3]

...but further down values start to go 'missing':

[![Pattern breaking down][4]][4]

...and eventually:

[![Chaos breaks out][5]][5]

The explanation in this particular case, is that the hash operator spills:

[![Post-execution plan][6]][6]

Once a partition spills, all rows that hash to the same partition also spill. Spilled partitions are processed later, breaking the expectation that distinct values encountered will be emitted immediately in the sequence they are received.

---

There are many ways to write an efficient query to produce the ordered result you want, such as recursion or using a cursor. However, it cannot be done using *Hash Match Flow Distinct*.

  [2]: https://i.stack.imgur.com/6M0va.png
  [3]: https://i.stack.imgur.com/uWTxU.png
  [4]: https://i.stack.imgur.com/Rx9u0.png
  [5]: https://i.stack.imgur.com/H42Q4.png
  [6]: https://i.stack.imgur.com/oV34e.png

1 Answer