Why does my SELECT DISTINCT TOP N query scan the entire table?

sql-server sql-server-2016 add tag

Joe Obbish (imported from SE)

I've run into a few `SELECT DISTINCT TOP N` queries which appear to be poorly optimized by the SQL Server query optimizer. Let's start by considering a trivial example: a million row table with two alternating values. I'll use the [GetNums][1] function to generate the data:

    DROP TABLE IF EXISTS X_2_DISTINCT_VALUES;
    
    CREATE TABLE X_2_DISTINCT_VALUES (PK INT IDENTITY (1, 1), VAL INT NOT NULL);
    
    INSERT INTO X_2_DISTINCT_VALUES WITH (TABLOCK) (VAL)
    SELECT N % 2
    FROM dbo.GetNums(1000000);
    
    UPDATE STATISTICS X_2_DISTINCT_VALUES WITH FULLSCAN;


For the following query:

    SELECT DISTINCT TOP 2 VAL
    FROM X_2_DISTINCT_VALUES
    OPTION (MAXDOP 1);

SQL Server can find two distinct values just by scanning the first data page of the table but it [scans all of the data instead][2]. Why doesn't SQL Server just scan until it finds the requested number of distinct values?

For this question please use the following test data that contains 10 million rows with 10 distinct values generated in blocks:

    DROP TABLE IF EXISTS X_10_DISTINCT_HEAP;
    
    CREATE TABLE X_10_DISTINCT_HEAP (VAL VARCHAR(10) NOT NULL);
    
    INSERT INTO X_10_DISTINCT_HEAP WITH (TABLOCK)
    SELECT REPLICATE(CHAR(65 + (N / 100000 ) % 10 ), 10)
    FROM dbo.GetNums(10000000);
    
    UPDATE STATISTICS X_10_DISTINCT_HEAP WITH FULLSCAN;


Answers for a table with a clustered index are also acceptable:

    DROP TABLE IF EXISTS X_10_DISTINCT_CI;
    
    CREATE TABLE X_10_DISTINCT_CI (PK INT IDENTITY (1, 1), VAL VARCHAR(10) NOT NULL, PRIMARY KEY (PK));
    
    INSERT INTO X_10_DISTINCT_CI WITH (TABLOCK) (VAL)
    SELECT REPLICATE(CHAR(65 + (N / 100000 ) % 10 ), 10)
    FROM dbo.GetNums(10000000);
    
    UPDATE STATISTICS X_10_DISTINCT_CI WITH FULLSCAN;


The following query [scans all 10 million rows from the table][3]. How can I get something that doesn't scan the entire table? I'm using SQL Server 2016 SP1.

    SELECT DISTINCT TOP 10 VAL
    FROM X_10_DISTINCT_HEAP
    OPTION (MAXDOP 1);


  [1]: http://tsql.solidq.com/SourceCodes/GetNums.txt
  [2]: https://www.brentozar.com/pastetheplan/?id=SyP3qAvKg
  [3]: https://www.brentozar.com/pastetheplan/?id=Hy113Avtg

Top Answer

Joe Obbish (imported from SE)

There look to be three different optimizer rules that can perform the `DISTINCT` operation in the above query. The following query throws an error which suggests that the list is exhaustive:

    SELECT DISTINCT TOP 10 ID
    FROM X_10_DISTINCT_HEAP
    OPTION (MAXDOP 1, QUERYRULEOFF GbAggToSort, QUERYRULEOFF GbAggToHS, QUERYRULEOFF GbAggToStrm);

> Msg 8622, Level 16, State 1, Line 1

> Query processor could not produce a query plan because of the hints defined in this query. Resubmit the query without specifying any hints and without using SET FORCEPLAN.

`GbAggToSort` implements the group-by aggregate (distinct) as a distinct sort. This is a blocking operator that will read all of the data from the input before producing any rows. `GbAggToStrm` implements the group-by aggregate as a stream aggregate (which also requires an input sort in this instance). This is also a blocking operator. `GbAggToHS` implements as a hash match, which is what we saw in the bad plan from the question, but it can be implemented as hash match (aggregate) or hash match (flow distinct).

The hash match ([flow distinct][1]) operator is one way to solve this problem because it is not blocking. SQL Server should be able to stop the scan once it finds enough distinct values.

> The Flow Distinct logical operator scans the input, removing duplicates. Whereas the Distinct operator consumes all input before producing any output, the Flow Distinct operator returns each row as it is obtained from the input (unless that row is a duplicate, in which case it is discarded).

Why does the query in the question use hash match (aggregate) instead of hash match (flow distinct)? As the number of distinct values changes in the table I would expect the cost of the hash match (flow distinct) query to decrease because the estimate of the number of rows it needs to scan to the table should decrease. I would expect the cost of the hash match (aggregate) plan to increase because the hash table that it needs to build will get bigger. One way to investigate this is by [creating a plan guide][2]. If I create two copies of the data but apply a plan guide to one of them I should be able to compare hash match (aggregate) to hash match (distinct) side by side against the same data. Note that I can't do this by disabling query optimizer rules because the same rule applies to both plans (`GbAggToHS`).

Here's one way to get the plan guide that I'm after:

    DROP TABLE IF EXISTS X_PLAN_GUIDE_TARGET;
    
    CREATE TABLE X_PLAN_GUIDE_TARGET (VAL VARCHAR(10) NOT NULL);
    
    INSERT INTO X_PLAN_GUIDE_TARGET WITH (TABLOCK)
    SELECT CAST(N % 10000 AS VARCHAR(10))
    FROM dbo.GetNums(10000000);
    
    UPDATE STATISTICS X_PLAN_GUIDE_TARGET WITH FULLSCAN;
    
    -- run this query
    SELECT DISTINCT TOP 10 VAL  FROM X_PLAN_GUIDE_TARGET  OPTION (MAXDOP 1)

Get the plan handle and use it to create a plan guide:

    -- plan handle is 0x060007009014BC025097E88F6C01000001000000000000000000000000000000000000000000000000000000
    SELECT qs.plan_handle, st.text FROM 
    sys.dm_exec_query_stats AS qs   
    CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS st  
    WHERE st.text LIKE '%X[_]PLAN[_]GUIDE[_]TARGET%'
    ORDER BY last_execution_time DESC;
    
    EXEC sp_create_plan_guide_from_handle 
    'EVIL_PLAN_GUIDE', 
    0x060007009014BC025097E88F6C01000001000000000000000000000000000000000000000000000000000000;

Plan guides only work on the exact query text, so let's copy it back from the plan guide:

    SELECT query_text
    FROM sys.plan_guides
    WHERE name = 'EVIL_PLAN_GUIDE';

Reset the data:

    TRUNCATE TABLE X_PLAN_GUIDE_TARGET;
    
    INSERT INTO X_PLAN_GUIDE_TARGET WITH (TABLOCK)
    SELECT REPLICATE(CHAR(65 + (N / 100000 ) % 10 ), 10)
    FROM dbo.GetNums(10000000);


Get a [query plan][3] for the query with the plan guide applied:

    SELECT DISTINCT TOP 10 VAL  FROM X_PLAN_GUIDE_TARGET  OPTION (MAXDOP 1)

This has the hash match (flow distinct) operator that we wanted with our test data. Note that SQL Server expects to read all of the rows from the table and that the estimated cost is the exact same as for the plan with the hash match (aggregate). The testing that I did suggested that the costs for the two plans are identical when the row goal for the plan is greater than or equal to the number of distinct values SQL Server expects from the table, which in this case can be simply derived from the statistics. Unfortunately (for our query) the optimizer picks the hash match (aggregate) over hash match (flow distinct) when the costs are the same. So we're 0.0000001 magic optimizer units away from the plan that we want.

One way to attack this problem is by decreasing the row goal. If the row goal from the optimizer's point of view is less than the distinct count of rows we'll probably get hash match (flow distinct). This can be accomplished with the `OPTIMIZE FOR` query hint:

    DECLARE @j INT = 10;
    
    SELECT DISTINCT TOP (@j) VAL
    FROM X_10_DISTINCT_HEAP
    OPTION (MAXDOP 1, OPTIMIZE FOR (@j = 1));

For this query the optimizer creates a plan as if the query just needs the first row but when the query is executed it gets back the first 10 rows. On my machine this query scans 892800 rows from `X_10_DISTINCT_HEAP` and completes in 299 ms with 250 ms of CPU time and 2537 logical reads. 

Note that this technique will not work if the statistics report just one distinct value, which could happen for sampled statistics against skewed data. However, in that case it's unlikely that your data is packed densely enough to justify using techniques like this. You may not lose much by scanning all of the data in the table, especially if that can be done in parallel.

Another way to attack this problem is by inflating the number of estimated distinct values SQL Server expects to get from the base table. This was harder than expected. Applying a deterministic function cannot possibly increase the distinct count of results. If the query optimizer is aware of that mathematical fact (some testing suggests it is at least for our purposes) then applying deterministic functions (which [includes all string functions][4]) will not increase the estimated number of distinct rows.

Many of the nondeterministic functions did not work either, including the obvious choices of `NEWID()` and `RAND()`. However, `LAG()` does the trick for this query. The query optimizer expects 10 million distinct values against the `LAG` expression which will encourage a [hash match (flow distinct) plan][5]:

    SELECT DISTINCT TOP 10 LAG(VAL, 0) OVER (ORDER BY (SELECT NULL)) AS ID
    FROM X_10_DISTINCT_HEAP
    OPTION (MAXDOP 1);

On my machine this query scans 892800 rows from `X_10_DISTINCT_HEAP` and completes in 1165 ms with 1109 ms of CPU time and 2537 logical reads, so the `LAG()` adds quite a bit of relative overhead. @Paul White suggested to try batch mode processing for this query. On SQL Server 2016 we can get batch mode processing even with `MAXDOP 1`. One way to get batch mode processing for a rowstore table is to join to an empty CCI as follows:

    CREATE TABLE #X_DUMMY_CCI (ID INT NOT NULL);
    
    CREATE CLUSTERED COLUMNSTORE INDEX X_DUMMY_CCI ON #X_DUMMY_CCI;
    
    SELECT DISTINCT TOP 10 VAL
    FROM
    (
    	SELECT LAG(VAL, 1) OVER (ORDER BY (SELECT NULL)) AS VAL
    	FROM X_10_DISTINCT_HEAP
    	LEFT OUTER JOIN #X_DUMMY_CCI ON 1 = 0
    ) t
    WHERE t.VAL IS NOT NULL
    OPTION (MAXDOP 1);

That code results in [this query plan][6].

Paul pointed out that I had to change the query to use `LAG(..., 1)` because `LAG(..., 0)` doesn't appear to be eligible for the Window Aggregate optimization. This change reduced the elapsed time to 520 ms and the CPU time to 454 ms.

Note that the `LAG()` approach isn't the most stable one. If Microsoft changes the uniqueness assumption against the function then it may no longer work. It has a different estimate with the legacy CE. Also this type of optimization against a heap isn't necessary a good idea. If the table is rebuilt it's possible to end up in a worst case scenario in which almost all of the rows need to be read from the table.

Against a table with a unique column (such as the clustered index example in the question) we have better options. For example we can trick the optimizer by using a `SUBSTRING` expression that always returns an empty string. SQL Server does not think that the `SUBSTRING` will change the number of distinct values so if we apply it to a unique column, such as PK, then the estimated number of distinct rows is 10 million. This following [query][7] gets the hash match (flow distinct) operator:

    SELECT DISTINCT TOP 10 VAL + SUBSTRING(CAST(PK AS VARCHAR(10)), 11, 1)
    FROM X_10_DISTINCT_CI
    OPTION (MAXDOP 1);

On my machine this query scans 900000 rows from `X_10_DISTINCT_CI` and completes in 333 ms with 297 ms of CPU time and 3011 logical reads.

In summary, the query optimizer appears to assume that all rows will be read from the table for `SELECT DISTINCT TOP N` queries when `N` >= the number of estimated distinct rows from the table. The hash match (aggregate) operator may have the same cost as the hash match (flow distinct) operator but the optimizer always picks the aggregate operator. This can lead to unnecessary logical reads when enough distinct values are located near the start of the table scan. Two ways to trick the optimizer into using the hash match (flow distinct) operator are to lower the row goal using the `OPTIMIZE FOR` hint or to increase the estimated number of distinct rows using `LAG()` or `SUBSTRING` on a unique column.

  [1]: https://technet.microsoft.com/en-us/library/ms177473(v=sql.105).aspx
  [2]: https://msdn.microsoft.com/en-us/library/bb964726.aspx
  [3]: https://www.brentozar.com/pastetheplan/?id=ryPKry_te
  [4]: https://msdn.microsoft.com/en-us/library/ms178091.aspx
  [5]: https://www.brentozar.com/pastetheplan/?id=Hkb2p1utx
  [6]: https://www.brentozar.com/pastetheplan/?id=ByYOllOYg
  [7]: https://www.brentozar.com/pastetheplan/?id=SyFXbeute

Answer #2

Joe Obbish (imported from SE)

For completeness, another way to approach this problem is to use [OUTER APPLY][1]. We can add an `OUTER APPLY` operator for each distinct value that we need to find. This is similar in concept to ypercube's recursive approach, but effectively has the recursion written out by hand. One advantage is that we're able to use `TOP` in the derived tables instead of the `ROW_NUMBER()` workaround. One big disadvantage is the query text gets longer as `N` increases.

Here is one implementation for the query against the heap:

    SELECT VAL
    FROM (
    	SELECT t1.VAL VAL1, t2.VAL VAL2, t3.VAL VAL3, t4.VAL VAL4, t5.VAL VAL5, t6.VAL VAL6, t7.VAL VAL7, t8.VAL VAL8, t9.VAL VAL9, t10.VAL VAL10
    	FROM 
    	( 
    	SELECT TOP 1 VAL FROM X_10_DISTINCT_HEAP 
    	) t1
    	OUTER APPLY
    	( 
    	SELECT TOP 1 VAL FROM X_10_DISTINCT_HEAP t2 WHERE t2.VAL NOT IN (t1.VAL)
    	) t2
    	OUTER APPLY
    	( 
    	SELECT TOP 1 VAL FROM X_10_DISTINCT_HEAP t3 WHERE t3.VAL NOT IN (t1.VAL, t2.VAL)
    	) t3
    	OUTER APPLY
    	( 
    	SELECT TOP 1 VAL FROM X_10_DISTINCT_HEAP t4 WHERE t4.VAL NOT IN (t1.VAL, t2.VAL, t3.VAL)
    	) t4
    	OUTER APPLY
    	( 
    	SELECT TOP 1 VAL FROM X_10_DISTINCT_HEAP t5 WHERE t5.VAL NOT IN (t1.VAL, t2.VAL, t3.VAL, t4.VAL)
    	) t5
    	OUTER APPLY
    	( 
    	SELECT TOP 1 VAL FROM X_10_DISTINCT_HEAP t6 WHERE t6.VAL NOT IN (t1.VAL, t2.VAL, t3.VAL, t4.VAL, t5.VAL)
    	) t6
    	OUTER APPLY
    	( 
    	SELECT TOP 1 VAL FROM X_10_DISTINCT_HEAP t7 WHERE t7.VAL NOT IN (t1.VAL, t2.VAL, t3.VAL, t4.VAL, t5.VAL, t6.VAL)
    	) t7
    	OUTER APPLY
    	( 
    	SELECT TOP 1 VAL FROM X_10_DISTINCT_HEAP t8 WHERE t8.VAL NOT IN (t1.VAL, t2.VAL, t3.VAL, t4.VAL, t5.VAL, t6.VAL, t7.VAL)
    	) t8
    	OUTER APPLY
    	( 
    	SELECT TOP 1 VAL FROM X_10_DISTINCT_HEAP t9 WHERE t9.VAL NOT IN (t1.VAL, t2.VAL, t3.VAL, t4.VAL, t5.VAL, t6.VAL, t7.VAL, t8.VAL)
    	) t9
    	OUTER APPLY
    	( 
    	SELECT TOP 1 VAL FROM X_10_DISTINCT_HEAP t10 WHERE t10.VAL NOT IN (t1.VAL, t2.VAL, t3.VAL, t4.VAL, t5.VAL, t6.VAL, t7.VAL, t8.VAL, t9.VAL)
    	) t10
    ) t
    UNPIVOT 
    (
      VAL FOR VALS IN (VAL1, VAL2, VAL3, VAL4, VAL5, VAL6, VAL7, VAL8, VAL9, VAL10)
    ) AS upvt;

[Here][2] is the actual query plan for the above query. On my machine this query completes in 713 ms with 625 ms of CPU time and 12605 logical reads. We get a new distinct value every 100k rows so I would expect this query to scan around 900000 * 10 * 0.5 = 4500000 rows. In theory this query should do five times the logical reads of this query from the other answer:

    DECLARE @j INT = 10;
    
    SELECT DISTINCT TOP (@j) VAL
    FROM X_10_DISTINCT_HEAP
    OPTION (MAXDOP 1, OPTIMIZE FOR (@j = 1));

That query did 2537 logical reads. 2537 * 5 = 12685 which is pretty close to 12605.

For the table with the clustered index we can do better. This is because we can pass in the last clustered key value into the derived table to avoid scanning the same rows twice. One implementation:

    SELECT VAL
    FROM (
    	SELECT t1.VAL VAL1, t2.VAL VAL2, t3.VAL VAL3, t4.VAL VAL4, t5.VAL VAL5, t6.VAL VAL6, t7.VAL VAL7, t8.VAL VAL8, t9.VAL VAL9, t10.VAL VAL10
    	FROM 
    	( 
    	SELECT TOP 1 PK, VAL FROM X_10_DISTINCT_CI 
    	) t1
    	OUTER APPLY
    	( 
    	SELECT TOP 1 PK, VAL FROM X_10_DISTINCT_CI t2 WHERE PK > t1.PK AND t2.VAL NOT IN (t1.VAL)
    	) t2
    	OUTER APPLY
    	( 
    	SELECT TOP 1 PK, VAL FROM X_10_DISTINCT_CI t3 WHERE PK > t2.PK AND t3.VAL NOT IN (t1.VAL, t2.VAL)
    	) t3
    	OUTER APPLY
    	( 
    	SELECT TOP 1 PK, VAL FROM X_10_DISTINCT_CI t4 WHERE PK > t3.PK AND t4.VAL NOT IN (t1.VAL, t2.VAL, t3.VAL)
    	) t4
    	OUTER APPLY
    	( 
    	SELECT TOP 1 PK, VAL FROM X_10_DISTINCT_CI t5 WHERE PK > t4.PK AND t5.VAL NOT IN (t1.VAL, t2.VAL, t3.VAL, t4.VAL)
    	) t5
    	OUTER APPLY
    	( 
    	SELECT TOP 1 PK, VAL FROM X_10_DISTINCT_CI t6 WHERE PK > t5.PK AND t6.VAL NOT IN (t1.VAL, t2.VAL, t3.VAL, t4.VAL, t5.VAL)
    	) t6
    	OUTER APPLY
    	( 
    	SELECT TOP 1 PK, VAL FROM X_10_DISTINCT_CI t7 WHERE PK > t6.PK AND t7.VAL NOT IN (t1.VAL, t2.VAL, t3.VAL, t4.VAL, t5.VAL, t6.VAL)
    	) t7
    	OUTER APPLY
    	( 
    	SELECT TOP 1 PK, VAL FROM X_10_DISTINCT_CI t8 WHERE PK > t7.PK AND t8.VAL NOT IN (t1.VAL, t2.VAL, t3.VAL, t4.VAL, t5.VAL, t6.VAL, t7.VAL)
    	) t8
    	OUTER APPLY
    	( 
    	SELECT TOP 1 PK, VAL FROM X_10_DISTINCT_CI t9 WHERE PK > t8.PK AND t9.VAL NOT IN (t1.VAL, t2.VAL, t3.VAL, t4.VAL, t5.VAL, t6.VAL, t7.VAL, t8.VAL)
    	) t9
    	OUTER APPLY
    	( 
    	SELECT TOP 1 PK, VAL FROM X_10_DISTINCT_CI t10 WHERE PK > t9.PK AND t10.VAL NOT IN (t1.VAL, t2.VAL, t3.VAL, t4.VAL, t5.VAL, t6.VAL, t7.VAL, t8.VAL, t9.VAL)
    	) t10
    ) t
    UNPIVOT 
    (
      VAL FOR VALS IN (VAL1, VAL2, VAL3, VAL4, VAL5, VAL6, VAL7, VAL8, VAL9, VAL10)
    ) AS upvt;

[Here][3] is the actual query plan for the above query. On my machine this query completes in 154 ms with 140 ms of CPU time and 3203 logical reads. This seemed to run a bit faster than the `OPTIMIZE FOR` query against the clustered index table. I wasn't expecting that so I tried to measure the performance more carefully. My methodology was to run each query ten times without result sets and to look at the aggregate numbers from `sys.dm_exec_sessions` and `sys.dm_exec_session_wait_stats`. Session 56 was the `APPLY` query and session 63 was the `OPTIMIZE FOR` query.

Output of `sys.dm_exec_sessions`:

    ╔════════════╦══════════╦════════════════════╦═══════════════╗
    ║ session_id ║ cpu_time ║ total_elapsed_time ║ logical_reads ║
    ╠════════════╬══════════╬════════════════════╬═══════════════╣
    ║         56 ║     1360 ║               1373 ║         32030 ║
    ║         63 ║     2094 ║               2091 ║         30400 ║
    ╚════════════╩══════════╩════════════════════╩═══════════════╝

There appears to be a clear advantage in cpu_time and elapsed_time for the `APPLY` query.

Output of `sys.dm_exec_session_wait_stats`:

    ╔════════════╦════════════════════════════════╦═════════════════════╦══════════════╦══════════════════╦═════════════════════╗
    ║ session_id ║           wait_type            ║ waiting_tasks_count ║ wait_time_ms ║ max_wait_time_ms ║ signal_wait_time_ms ║
    ╠════════════╬════════════════════════════════╬═════════════════════╬══════════════╬══════════════════╬═════════════════════╣
    ║         56 ║ SOS_SCHEDULER_YIELD            ║                 340 ║            0 ║                0 ║                   0 ║
    ║         56 ║ MEMORY_ALLOCATION_EXT          ║                  38 ║            0 ║                0 ║                   0 ║
    ║         63 ║ SOS_SCHEDULER_YIELD            ║                 518 ║            0 ║                0 ║                   0 ║
    ║         63 ║ MEMORY_ALLOCATION_EXT          ║                  98 ║            0 ║                0 ║                   0 ║
    ║         63 ║ RESERVED_MEMORY_ALLOCATION_EXT ║                 400 ║            0 ║                0 ║                   0 ║
    ╚════════════╩════════════════════════════════╩═════════════════════╩══════════════╩══════════════════╩═════════════════════╝

The `OPTIMIZE FOR` query has an additional wait type, [RESERVED_MEMORY_ALLOCATION_EXT][4]. I don't exactly know what this means. It may just be a measurement of the overhead in the hash match (flow distinct) operator. In any case, perhaps it's not worth worrying about a difference of 70 ms in CPU time.


  [1]: https://technet.microsoft.com/en-us/library/ms175156(v=sql.105).aspx
  [2]: https://www.brentozar.com/pastetheplan/?id=ByUez8ttl
  [3]: https://www.brentozar.com/pastetheplan/?id=Bku9MUKFx
  [4]: https://www.sqlskills.com/help/waits/reserved_memory_allocation_ext/

Answer #3

ypercubeᵀᴹ (imported from SE)

Here is an attempt to emulate a repeated partial scan (similar to but not the same as a skip scan) using a recursive CTE. The aim - since we have no index on `(id)` - is to avoid sorts and multiple scans on the table.  

It does a few tricks to bypass some recursive CTE restrictions:

- No `TOP` allowed in the recursive part. We use a subquery and `ROW_NUMBER()` instead.
- We can't have multiple references to the constant part or use `LEFT JOIN` or use `NOT IN (SELECT id FROM cte)` from the recursive part. To bypass, we build a `VARCHAR` string that accumulates all the `id` values, similar to `STRING_AGG` or to hierarchyID and then compare with `LIKE`.

For a Heap (assuming the column is named `id`) [test-1 on rextester.com][1].

This - as tests have shown - doesn't avoid multiple scans but performs OK when different values are found in the first few pages. If however the values are not evenly distributed, it may do multiple scans on big parts of the table - which pf course results in poor performance. 

    WITH ct (id, found, list) AS
      ( SELECT TOP (1) id, 1, CAST('/' + id + '/' AS VARCHAR(MAX))
        FROM x_large_table_2
      UNION ALL
        SELECT y.ID, ct.found + 1, CAST(ct.list + y.id + '/' AS VARCHAR(MAX))
        FROM ct
          CROSS APPLY 
          ( SELECT x.id, 
                   rn = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
            FROM x_large_table_2 AS x
            WHERE ct.list NOT LIKE '%/' + id + '/%'
          ) AS y
        WHERE ct.found < 3         -- the TOP (n) parameter here
          AND y.rn = 1
      )
    SELECT id FROM ct ;

and when the table is clustered (CI on `unique_key`), [test-2 on rextester.com][2]. 

This uses the clustered index (`WHERE x.unique_key > ct.unique_key`) to avoid multiple scans:

    WITH ct (unique_key, id, found, list) AS
      ( SELECT TOP (1) unique_key, id, 1, CAST(CONCAT('/',id, '/') AS VARCHAR(MAX))
        FROM x_large_table_2
      UNION ALL
        SELECT y.unique_key, y.ID, ct.found + 1, 
            CAST(CONCAT(ct.list, y.id, '/') AS VARCHAR(MAX))
        FROM ct
          CROSS APPLY 
          ( SELECT x.unique_key, x.id, 
                   rn = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
            FROM x_large_table_2 AS x
            WHERE x.unique_key > ct.unique_key
              AND ct.list NOT LIKE '%/' + id + '/%'
          ) AS y
        WHERE ct.found < 5       -- the TOP (n) parameter here
          AND y.rn = 1
      )
    -- SELECT * FROM ct ;        -- for debugging
    SELECT id FROM ct ;



  [1]: http://rextester.com/MTWT95936
 
  [2]: http://rextester.com/SZAQ33742

Answer #4

Paul White (imported from SE)

You've already answered your own questions correctly.

I just want to add an observation that the most efficient way is actually to scan the whole table - if it can be organized as a [columnstore 'heap'][1]:

    CREATE CLUSTERED COLUMNSTORE INDEX CCSI 
    ON dbo.X_10_DISTINCT_HEAP;

The simple query:

    SELECT DISTINCT TOP (10)
    	XDH.VAL 
    FROM dbo.X_10_DISTINCT_HEAP AS XDH
    OPTION (MAXDOP 1);

then gives:

[![Execution plan][2]][2]


> Table 'X_10_DISTINCT_HEAP'. Scan count 1,   
>  logical reads 0, physical reads 0, read-ahead reads 0,   
>  <b>lob logical reads 66</b>, lob physical reads 0, lob read-ahead reads 0.  
> Table 'X_10_DISTINCT_HEAP'. Segment reads 13, segment skipped 0.  
>   
>  SQL Server Execution Times:  
>    **CPU time = 0 ms,  elapsed time = 11 ms.**


Hash Match (Flow Distinct) cannot currently execute in batch mode. The methods that use this are much slower due to the (invisible) expensive transition from batch to row processing. For example:

    SET ROWCOUNT 10;
    
    SELECT DISTINCT 
    	XDH.VAL
    FROM dbo.X_10_DISTINCT_HEAP AS XDH
    OPTION (FAST 1);
    
    SET ROWCOUNT 0;

Gives:

[![Flow Distinct Execution Plan][3]][3]



> Table 'X_10_DISTINCT_HEAP'. Scan count 1,   
>  logical reads 0, physical reads 0, read-ahead reads 0,   
>  <b>lob logical reads 20</b>, lob physical reads 0, lob read-ahead reads 0.  
> Table 'X_10_DISTINCT_HEAP'. <b>Segment reads 4</b>, segment skipped 0.  
>   
>  SQL Server Execution Times:  
>    **CPU time = 640 ms,  elapsed time = 680 ms.**



This is slower than when the table is organized as a rowstore heap.


  [1]: https://techcommunity.microsoft.com/t5/sql-server/columnstore-index-differences-between-columnstore-index-vs-btree/ba-p/384763
  [2]: https://i.stack.imgur.com/BhPVa.png
  [3]: https://i.stack.imgur.com/3XWhz.png

Answer #5

paparazzo (imported from SE)

I think you have answer on why  
This may be a way to address it  
I know it looks messy but the execution plan said distinct top 2 was 84% of the cost  

    SELECT distinct top (2)  [enumID]
    FROM [ENRONbbb].[dbo].[docSVenum1]

    declare @table table (enumID tinyint);
    declare @enumID tinyint;
    set @enumID = (select top (1) [enumID] from [docSVenum1]);
    insert into @table values (@enumID);
    set @enumID = (select top (1) [enumID] from [docSVenum1] where [enumID] not in (select enumID from @table));
    insert into @table values (@enumID);
    set @enumID = (select top (1) [enumID] from [docSVenum1] where [enumID] not in (select enumID from @table));
    insert into @table values (@enumID);
    set @enumID = (select top (1) [enumID] from [docSVenum1] where [enumID] not in (select enumID from @table));
    insert into @table values (@enumID);
    set @enumID = (select top (1) [enumID] from [docSVenum1] where [enumID] not in (select enumID from @table));
    insert into @table values (@enumID);
    set @enumID = (select top (1) [enumID] from [docSVenum1] where [enumID] not in (select enumID from @table));
    insert into @table values (@enumID);
    set @enumID = (select top (1) [enumID] from [docSVenum1] where [enumID] not in (select enumID from @table));
    insert into @table values (@enumID);
    set @enumID = (select top (1) [enumID] from [docSVenum1] where [enumID] not in (select enumID from @table));
    insert into @table values (@enumID);
    set @enumID = (select top (1) [enumID] from [docSVenum1] where [enumID] not in (select enumID from @table));
    insert into @table values (@enumID);
    set @enumID = (select top (1) [enumID] from [docSVenum1] where [enumID] not in (select enumID from @table));
    insert into @table values (@enumID);
    select enumID from @table;

5 Answers