Is there appreciable overhead from having a large proportion of empty table partitions in SQL Server? - Databases

TopAnswers Databases
Top Answer
Martin Smith (imported from SE)
This situation *can* cause noticeable performance issues.

**Setup**

    CREATE PARTITION FUNCTION [pf](datetime2(2)) AS RANGE RIGHT FOR VALUES (N'2020-05-06', N'2020-05-07', N'2020-05-08', N'2020-05-09', N'2020-05-10', N'2020-05-11', N'2020-05-12', N'2020-05-13', N'2020-05-14', N'2020-05-15', N'2020-05-16', N'2020-05-17', N'2020-05-18', N'2020-05-19', N'2020-05-20', N'2020-05-21', N'2020-05-22', N'2020-05-23', N'2020-05-24', N'2020-05-25', N'2020-05-26', N'2020-05-27', N'2020-05-28', N'2020-05-29', N'2020-05-30', N'2020-05-31', N'2020-06-01', N'2020-06-02', N'2020-06-03', N'2020-06-04', N'2020-06-05', N'2020-06-06', N'2020-06-07', N'2020-06-08', N'2020-06-09', N'2020-06-10', N'2020-06-11', N'2020-06-12', N'2020-06-13', N'2020-06-14', N'2020-06-15', N'2020-06-16', N'2020-06-17', N'2020-06-18', N'2020-06-19', N'2020-06-20', N'2020-06-21', N'2020-06-22', N'2020-06-23', N'2020-06-24', N'2020-06-25', N'2020-06-26', N'2020-06-27', N'2020-06-28', N'2020-06-29', N'2020-06-30', N'2020-07-01', N'2020-07-02', N'2020-07-03', N'2020-07-04', N'2020-07-05',N'2020-07-06', N'2020-07-07', N'2020-07-08', N'2020-07-09', N'2020-07-10', N'2020-07-11', N'2020-07-12', N'2020-07-13', N'2020-07-14', N'2020-07-15', N'2020-07-16', N'2020-07-17', N'2020-07-18', N'2020-07-19', N'2020-07-20', N'2020-07-21', N'2020-07-22', N'2020-07-23', N'2020-07-24', N'2020-07-25', N'2020-07-26', N'2020-07-27', N'2020-07-28', N'2020-07-29', N'2020-07-30', N'2020-07-31', N'2020-08-01', N'2020-08-02', N'2020-08-03')
    
    CREATE PARTITION SCHEME [ps] AS PARTITION [pf] ALL TO ([PRIMARY])
    
    CREATE TABLE T1(X INT PRIMARY KEY);
    
    INSERT INTO T1
    SELECT TOP 30000 ROW_NUMBER() OVER (ORDER BY @@SPID)
    FROM   sys.all_objects o1,
           sys.all_objects o2
    
    CREATE TABLE T2
    (
       X        INT,
       dt2      DATETIME2(2),
       OtherCol CHAR(100),
       PRIMARY KEY(X, dt2) ON ps(dt2)
    );
    
    INSERT INTO T2 (X, dt2)
    SELECT TOP 21474836 ROW_NUMBER() OVER (ORDER BY @@SPID),
                        DATEADD(MILLISECOND, 100 * ROW_NUMBER() OVER (ORDER BY @@SPID), '2020-05-05')
    FROM   sys.all_objects o1,
           sys.all_objects o2,
           sys.all_objects o3 

This sets up a situation somewhat similar to described in the question. The first 25 partitions have data and the remaining 66 are empty.

**Query**

    SET STATISTICS TIME ON;
    
    SELECT COUNT(*)
    FROM T1 INNER JOIN T2 ON T1.X = T2.X

[![enter image description here][1]][1]

The above took 9.8 seconds for me. It needed to scan the whole of `T2`.

We have a partition aligned index with leading column `X` so what happens if we force a loop join?

    SELECT COUNT(*)
    FROM T1 INNER LOOP JOIN T2 ON T1.X = T2.X

[![enter image description here][2]][2]

The loop join was actually slightly worse (0.4 seconds slower). The 30,000 seeks are unable to do any partition elimination and all need to inspect 91 partitions, multiplying out the work required significantly.

One final attempt...

    SELECT COUNT(*)
    FROM T1
    CROSS APPLY (SELECT TOP 1 $partition.pf(dt2) FROM T2 ORDER BY $partition.pf(dt2) DESC) CA(MaxPtn)
    INNER LOOP JOIN T2 ON T1.X = T2.X AND $partition.pf(dt2) <= CA.MaxPtn

[![enter image description here][3]][3]

This completed in 3.7 seconds for me. The difference is that the query now first identifies the top non empty partition and in the subsequent seeks uses this value to avoid needing to do any work for the empty ones. 

So my conclusion is that the empty partitions certainly can have a noticeable impact on query performance, and likely should be addressed if performing queries that do not include the partition column as a predicate. 

Removing all but one empty tail partition with the below...

    DECLARE @dt datetime2(2) = '2020-08-03'
    
    WHILE @dt >=  '2020-05-31'
    BEGIN
    ALTER PARTITION FUNCTION pf()  
     MERGE RANGE (@dt)  
     SET @dt = DATEADD(DAY, -1, @dt)
    END

... gives the same speedup (and no longer needed to use the hint to get a loop join plan as the estimated operator cost for the seek fell from `523.328` in the original `INNER LOOP JOIN` case to `149.546`)

[![enter image description here][4]][4]


  [1]: https://i.stack.imgur.com/30CrF.png
  [2]: https://i.stack.imgur.com/1xJAO.png
  [3]: https://i.stack.imgur.com/qY1pk.png
  [4]: https://i.stack.imgur.com/V0KQj.png
1 Answer