GROUP BY with MAX versus just MAX

sql-server add tag

Ofiris (imported from SE)

I am a programmer, dealing with a big table which the following scheme:

    UpdateTime, PK, datetime, notnull
    Name, PK, char(14), notnull
    TheData, float

There is a clustered index on `Name, UpdateTime`

I was wondering what should be faster:

    SELECT MAX(UpdateTime)
    FROM [MyTable]
or

    SELECT MAX([UpdateTime]) AS value
    from
       (
    	SELECT [UpdateTime]
        FROM [MyTable]
        group by [UpdateTime]
       ) as t

The inserts to this table are in chunks of 50,000 rows with **the same date**.
So I thought grouping by might ease the `MAX` calculation.

Instead of trying to find max of 150,000 rows, grouping by to 3 rows, and then calculation of `MAX` would be faster ? Is my assumption correct or group by is also costly ?

Top Answer

Craig Efrein (imported from SE)

I created the table big_table according to your schema

    create table big_table
    (
    	updatetime datetime not null,
    	name char(14) not null,
    	TheData float,
    	primary key(Name,updatetime)
    )

I then filled the table with 50,000 rows with this code:

    DECLARE @ROWNUM as bigint = 1
    WHILE(1=1)
    BEGIN
    	set @rownum  = @ROWNUM + 1
    	insert into big_table values(getdate(),'name' + cast(@rownum as CHAR), cast(@rownum as float))
    	if @ROWNUM > 50000
    		BREAK;	
    END

Using SSMS, I then tested both queries and realized that in the first query you are looking for the MAX of TheData and in the second, the MAX of updatetime

I thus modified the first query to also get the MAX of updatetime


    set statistics time on -- execution time
    set statistics io on -- io stats (how many pages read, temp tables)
    
    -- query 1
    SELECT MAX([UpdateTime])
    FROM big_table
    
    -- query 2
    SELECT MAX([UpdateTime]) AS value
    from
       (
        SELECT [UpdateTime]
        FROM big_table
        group by [UpdateTime]
       ) as t
       
       
    set statistics time off
    set statistics io off

Using [Statistics Time][1] I get back the number of milliseconds required to parse, compile, and execute each statement

Using [Statistics IO][2] I get back information about disk activity

STATISTICS TIME and STATISTICS IO provide useful information. Such as were temporary tables used (indicated by worktable). Also how many logical pages read were read which indicates the number of database pages read from cache.

I then activate Execution plan with CTRL+M (activates show actual execution plan) and then execute with F5.

This will provide a comparison of both queries.

Here is the output of the **Messages Tab**

**-- Query 1**

Table 'big_table'. Scan count 1, *logical reads 543*, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   *CPU time = 16 ms,  elapsed time = 6 ms*.

**-- Query 2**

Table '*Worktable*'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'big_table'. Scan count 1, *logical reads 543*, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   *CPU time = 0 ms,  elapsed time = 35 ms*.

Both queries result in 543 logical reads, but the second query has an elapsed time of 35ms where as the first has only 6ms. You will also notice that the second query results in the use of temporary tables in tempdb, indicated by the word *worktable*. Even though all the values for worktable are at 0, work was still done in tempdb. 

Then there is the output from the actual **Execution plan** tab next to the Messages tab

![enter image description here][3]

According to the execution plan provided by MSSQL, the second query you provided has a total batch cost of 64% whereas the first only costs 36% of the total batch, so the first query requires less work.

Using SSMS, you can test and compare your queries and find out exactly how MSSQL is parsing your queries and what objects: tables, indexes and/or stats if any are being used to satisfy those queries. 

One additional side note to keep in mind when testing is cleaning out the cache before testing, if possible.  This helps to ensure that comparisons are accurate and this is important when thinking about disk activity. I start off with [DBCC DROPCLEANBUFFERS][5] and [DBCC FREEPROCCACHE][4] to clear out all the cache. **Be careful though not to use these commands on a production server actually in use** as you will effectively force the server to read everything from disk into memory. 

Here is the relevant documentation. 

1. Clear the plan cache with [DBCC FREEPROCCACHE][4]
2. Clear out everything from the buffer pool with [DBCC DROPCLEANBUFFERS][5]

Using these commands may not be possible depending on how your environment is used. 

**Updated 10/28 12:46pm**

Made corrections to the execution plan image and statistics output.


  [1]: http://technet.microsoft.com/en-us/library/ms190287.aspx
  [2]: http://technet.microsoft.com/en-us/library/ms184361.aspx
  [3]: https://i.stack.imgur.com/jMEdj.png
  [4]: http://technet.microsoft.com/en-us/library/ms174283.aspx
  [5]: http://technet.microsoft.com/en-us/library/ms187762.aspx

Answer #2

Paul White (imported from SE)

>*The inserts to this table are in chunks of 50,000 rows with the same date. So I thought grouping by might ease the MAX calculation.*

The rewrite might have helped if SQL Server implemented index skip-scan, but it does not.

Index skip-scan allows a database engine to seek to the next different index value instead of scanning all the duplicates (or irrelevant sub-keys) in between. In your case, skip-scan would allow the engine to find the `MAX(UpdateTime)` for the first `Name`, skip to the `MAX(UpdateTime)` for the second `Name`...and so on. The final step would be to find the `MAX(UpdateTime)` from the one-per-name candidates.

You can simulate this to some extent using a recursive CTE, but it is a bit messy, and not as efficient as built-in skip-scan would be:

    WITH RecursiveCTE
    AS
    (
        -- Anchor: MAX UpdateTime for
        -- highest-sorting Name
        SELECT TOP (1)
            BT.Name,
            BT.UpdateTime
        FROM dbo.BigTable AS BT
        ORDER BY
            BT.Name DESC,
            BT.UpdateTime DESC
    
        UNION ALL
    
        -- Recursive part
        -- MAX UpdateTime for Name
        -- that sorts immediately lower
        SELECT
            SubQuery.Name,
            SubQuery.UpdateTime
        FROM 
        (
            SELECT
                BT.Name,
                BT.UpdateTime,
                rn = ROW_NUMBER() OVER (
                    ORDER BY BT.Name DESC, BT.UpdateTime DESC)
            FROM RecursiveCTE AS R
            JOIN dbo.BigTable AS BT
                ON BT.Name < R.Name
        ) AS SubQuery
        WHERE
            SubQuery.rn = 1
    )
    -- Final MAX aggregate over
    -- MAX(UpdateTime) per Name
    SELECT MAX(UpdateTime) 
    FROM RecursiveCTE
    OPTION (MAXRECURSION 0);

![Recursive CTE plan][1]

That plan performs a singleton seek for each distinct `Name`, then finds the highest `UpdateTime` from the candidates. It's performance relative to a simple full scan of the table depends on how many duplicates there are per `Name`, and whether the pages touched by the singleton seeks are in memory or not.

### Alternative solutions

If you are able to create a new index on this table, a good choice for this query would be an index on `UpdateTime` alone:

    CREATE INDEX IX__BigTable_UpdateTime 
    ON dbo.BigTable (UpdateTime);

This index will allow the execution engine to find the highest `UpdateTime` with a singleton seek to the end of the index b-tree:

![New index plan][2]

This plan consumes just a few logical IOs (to navigate the b-tree levels) and completes immediately. Note that the Index Scan in the plan is not a full scan of the new index - it merely returns one row from the one 'end' of the index.

If you do not want to create a complete new index on the table, you might consider an indexed view containing just the unique `UpdateTime` values:

    CREATE VIEW dbo.BigTableUpdateTimes
    WITH SCHEMABINDING AS
    SELECT 
        UpdateTime, 
        NumRows = COUNT_BIG(*)
    FROM dbo.BigTable AS BT
    GROUP BY
        UpdateTime;
    GO
    CREATE UNIQUE CLUSTERED INDEX cuq
    ON dbo.BigTableUpdateTimes (UpdateTime);

This has the advantage of only creating a structure with as many rows as there are unique `UpdateTime` values, though every query that changes data in the base table will have extra operators added to its execution plan to maintain the indexed view. The query to find the maximum `UpdateTime` value would be:

    SELECT MAX(BTUT.UpdateTime)
    FROM dbo.BigTableUpdateTimes AS BTUT
        WITH (NOEXPAND);

![Indexed view plan][3]

  [1]: https://i.stack.imgur.com/lwfpN.png
  [2]: https://i.stack.imgur.com/G6JSA.png
  [3]: https://i.stack.imgur.com/2IacO.png

2 Answers