I am a programmer, dealing with a big table which the following scheme: UpdateTime, PK, datetime, notnull Name, PK, char(14), notnull TheData, float There is a clustered index on `Name, UpdateTime` I was wondering what should be faster: SELECT MAX(UpdateTime) FROM [MyTable] or SELECT MAX([UpdateTime]) AS value from ( SELECT [UpdateTime] FROM [MyTable] group by [UpdateTime] ) as t The inserts to this table are in chunks of 50,000 rows with **the same date**. So I thought grouping by might ease the `MAX` calculation. Instead of trying to find max of 150,000 rows, grouping by to 3 rows, and then calculation of `MAX` would be faster ? Is my assumption correct or group by is also costly ?
Craig Efrein (imported from SE)
I created the table big_table according to your schema create table big_table ( updatetime datetime not null, name char(14) not null, TheData float, primary key(Name,updatetime) ) I then filled the table with 50,000 rows with this code: DECLARE @ROWNUM as bigint = 1 WHILE(1=1) BEGIN set @rownum = @ROWNUM + 1 insert into big_table values(getdate(),'name' + cast(@rownum as CHAR), cast(@rownum as float)) if @ROWNUM > 50000 BREAK; END Using SSMS, I then tested both queries and realized that in the first query you are looking for the MAX of TheData and in the second, the MAX of updatetime I thus modified the first query to also get the MAX of updatetime set statistics time on -- execution time set statistics io on -- io stats (how many pages read, temp tables) -- query 1 SELECT MAX([UpdateTime]) FROM big_table -- query 2 SELECT MAX([UpdateTime]) AS value from ( SELECT [UpdateTime] FROM big_table group by [UpdateTime] ) as t set statistics time off set statistics io off Using [Statistics Time] I get back the number of milliseconds required to parse, compile, and execute each statement Using [Statistics IO] I get back information about disk activity STATISTICS TIME and STATISTICS IO provide useful information. Such as were temporary tables used (indicated by worktable). Also how many logical pages read were read which indicates the number of database pages read from cache. I then activate Execution plan with CTRL+M (activates show actual execution plan) and then execute with F5. This will provide a comparison of both queries. Here is the output of the **Messages Tab** **-- Query 1** Table 'big_table'. Scan count 1, *logical reads 543*, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: *CPU time = 16 ms, elapsed time = 6 ms*. **-- Query 2** Table '*Worktable*'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'big_table'. Scan count 1, *logical reads 543*, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: *CPU time = 0 ms, elapsed time = 35 ms*. Both queries result in 543 logical reads, but the second query has an elapsed time of 35ms where as the first has only 6ms. You will also notice that the second query results in the use of temporary tables in tempdb, indicated by the word *worktable*. Even though all the values for worktable are at 0, work was still done in tempdb. Then there is the output from the actual **Execution plan** tab next to the Messages tab ![enter image description here] According to the execution plan provided by MSSQL, the second query you provided has a total batch cost of 64% whereas the first only costs 36% of the total batch, so the first query requires less work. Using SSMS, you can test and compare your queries and find out exactly how MSSQL is parsing your queries and what objects: tables, indexes and/or stats if any are being used to satisfy those queries. One additional side note to keep in mind when testing is cleaning out the cache before testing, if possible. This helps to ensure that comparisons are accurate and this is important when thinking about disk activity. I start off with [DBCC DROPCLEANBUFFERS] and [DBCC FREEPROCCACHE] to clear out all the cache. **Be careful though not to use these commands on a production server actually in use** as you will effectively force the server to read everything from disk into memory. Here is the relevant documentation. 1. Clear the plan cache with [DBCC FREEPROCCACHE] 2. Clear out everything from the buffer pool with [DBCC DROPCLEANBUFFERS] Using these commands may not be possible depending on how your environment is used. **Updated 10/28 12:46pm** Made corrections to the execution plan image and statistics output. : http://technet.microsoft.com/en-us/library/ms190287.aspx : http://technet.microsoft.com/en-us/library/ms184361.aspx : https://i.stack.imgur.com/jMEdj.png : http://technet.microsoft.com/en-us/library/ms174283.aspx : http://technet.microsoft.com/en-us/library/ms187762.aspx
Paul White (imported from SE)
>*The inserts to this table are in chunks of 50,000 rows with the same date. So I thought grouping by might ease the MAX calculation.* The rewrite might have helped if SQL Server implemented index skip-scan, but it does not. Index skip-scan allows a database engine to seek to the next different index value instead of scanning all the duplicates (or irrelevant sub-keys) in between. In your case, skip-scan would allow the engine to find the `MAX(UpdateTime)` for the first `Name`, skip to the `MAX(UpdateTime)` for the second `Name`...and so on. The final step would be to find the `MAX(UpdateTime)` from the one-per-name candidates. You can simulate this to some extent using a recursive CTE, but it is a bit messy, and not as efficient as built-in skip-scan would be: WITH RecursiveCTE AS ( -- Anchor: MAX UpdateTime for -- highest-sorting Name SELECT TOP (1) BT.Name, BT.UpdateTime FROM dbo.BigTable AS BT ORDER BY BT.Name DESC, BT.UpdateTime DESC UNION ALL -- Recursive part -- MAX UpdateTime for Name -- that sorts immediately lower SELECT SubQuery.Name, SubQuery.UpdateTime FROM ( SELECT BT.Name, BT.UpdateTime, rn = ROW_NUMBER() OVER ( ORDER BY BT.Name DESC, BT.UpdateTime DESC) FROM RecursiveCTE AS R JOIN dbo.BigTable AS BT ON BT.Name < R.Name ) AS SubQuery WHERE SubQuery.rn = 1 ) -- Final MAX aggregate over -- MAX(UpdateTime) per Name SELECT MAX(UpdateTime) FROM RecursiveCTE OPTION (MAXRECURSION 0); ![Recursive CTE plan] That plan performs a singleton seek for each distinct `Name`, then finds the highest `UpdateTime` from the candidates. It's performance relative to a simple full scan of the table depends on how many duplicates there are per `Name`, and whether the pages touched by the singleton seeks are in memory or not. ### Alternative solutions If you are able to create a new index on this table, a good choice for this query would be an index on `UpdateTime` alone: CREATE INDEX IX__BigTable_UpdateTime ON dbo.BigTable (UpdateTime); This index will allow the execution engine to find the highest `UpdateTime` with a singleton seek to the end of the index b-tree: ![New index plan] This plan consumes just a few logical IOs (to navigate the b-tree levels) and completes immediately. Note that the Index Scan in the plan is not a full scan of the new index - it merely returns one row from the one 'end' of the index. If you do not want to create a complete new index on the table, you might consider an indexed view containing just the unique `UpdateTime` values: CREATE VIEW dbo.BigTableUpdateTimes WITH SCHEMABINDING AS SELECT UpdateTime, NumRows = COUNT_BIG(*) FROM dbo.BigTable AS BT GROUP BY UpdateTime; GO CREATE UNIQUE CLUSTERED INDEX cuq ON dbo.BigTableUpdateTimes (UpdateTime); This has the advantage of only creating a structure with as many rows as there are unique `UpdateTime` values, though every query that changes data in the base table will have extra operators added to its execution plan to maintain the indexed view. The query to find the maximum `UpdateTime` value would be: SELECT MAX(BTUT.UpdateTime) FROM dbo.BigTableUpdateTimes AS BTUT WITH (NOEXPAND); ![Indexed view plan] : https://i.stack.imgur.com/lwfpN.png : https://i.stack.imgur.com/G6JSA.png : https://i.stack.imgur.com/2IacO.png