Gaps and islands: client solution vs T-SQL query

sql-server add tag

A K (imported from SE)

Can a T-SQL solution for gaps and islands run faster than a C# solution running on the client?

To be specific, let us provide some test data:

    CREATE TABLE dbo.Numbers
      (
        n INT NOT NULL
              PRIMARY KEY
      ) ; 
    GO 
    
    INSERT  INTO dbo.Numbers
            ( n )
    VALUES  ( 1 ) ; 
    GO 
    DECLARE @i INT ; 
    SET @i = 0 ; 
    WHILE @i < 21 
      BEGIN 
        INSERT  INTO dbo.Numbers
                ( n 
                )
                SELECT  n + POWER(2, @i)
                FROM    dbo.Numbers ; 
        SET @i = @i + 1 ; 
      END ;  
    GO
    
    CREATE TABLE dbo.Tasks
      (
        StartedAt SMALLDATETIME NOT NULL ,
        FinishedAt SMALLDATETIME NOT NULL ,
        CONSTRAINT PK_Tasks PRIMARY KEY ( StartedAt, FinishedAt ) ,
        CONSTRAINT UNQ_Tasks UNIQUE ( FinishedAt, StartedAt )
      ) ;
    GO
    
    INSERT  INTO dbo.Tasks
            ( StartedAt ,
              FinishedAt
            )
            SELECT  DATEADD(MINUTE, n, '20100101') AS StartedAt ,
                    DATEADD(MINUTE, n + 2, '20100101') AS FinishedAt
            FROM    dbo.Numbers
            WHERE   ( n < 500000
                      OR n > 500005
                    )
    GO

This first set of test data has exactly one gap:

    SELECT  StartedAt ,
            FinishedAt
    FROM    dbo.Tasks
    WHERE   StartedAt BETWEEN DATEADD(MINUTE, 499999, '20100101')
                      AND     DATEADD(MINUTE, 500006, '20100101')


The second set of test data has 2M -1 gaps, a gap between each two adjacent intervals:

    TRUNCATE TABLE dbo.Tasks;
    GO
    
    INSERT  INTO dbo.Tasks
            ( StartedAt ,
              FinishedAt
            )
            SELECT  DATEADD(MINUTE, 3*n, '20100101') AS StartedAt ,
                    DATEADD(MINUTE, 3*n + 2, '20100101') AS FinishedAt
            FROM    dbo.Numbers
            WHERE   ( n < 500000
                      OR n > 500005
                    )
    GO

Currently I am running 2008 R2, but 2012 solutions are very welcome.
I have posted my C# solution as an answer.

Top Answer

A K (imported from SE)

The following C# code solves the problem:

        var connString =
            "Initial Catalog=MyDb;Data Source=MyServer;Integrated Security=SSPI;Application Name=Benchmarks;";

        var stopWatch = new Stopwatch();
        stopWatch.Start();

        using (var conn = new SqlConnection(connString))
        {
            conn.Open();
            var command = conn.CreateCommand();
            command.CommandText = "dbo.GetAllTaskEvents";
            command.CommandType = CommandType.StoredProcedure;
            var gaps = new List<string>();
            using (var dr = command.ExecuteReader())
            {
                var currentEvents = 0;
                var gapStart = new DateTime();
                var gapStarted = false;
                while (dr.Read())
                {
                    var change = dr.GetInt32(1);
                    if (change == -1 && currentEvents == 1)
                    {
                        gapStart = dr.GetDateTime(0);
                        gapStarted = true;
                    }
                    else if (change == 1 && currentEvents == 0 && gapStarted)
                    {
                        gaps.Add(string.Format("({0},{1})", gapStart, dr.GetDateTime(0)));
                        gapStarted = false;
                    }
                    currentEvents += change;
                }
            }
            File.WriteAllLines(@"C:\Temp\Gaps.txt", gaps);
        }

        stopWatch.Stop();
        System.Console.WriteLine("Elapsed: " + stopWatch.Elapsed);

This code invokes this stored procedure:

    CREATE PROCEDURE dbo.GetAllTaskEvents
    AS 
      BEGIN ;
        SELECT  EventTime ,
                Change
        FROM    ( SELECT  StartedAt AS EventTime ,
                          1 AS Change
                  FROM    dbo.Tasks
                  UNION ALL
                  SELECT  FinishedAt AS EventTime ,
                          -1 AS Change
                  FROM    dbo.Tasks
                ) AS TaskEvents
        ORDER BY EventTime, Change DESC ;
      END ;
    GO

It finds and prints one gap in 2M intervals in the following times, warm cache:

    1 gap: Elapsed: 00:00:01.4852029 00:00:01.4444307 00:00:01.4644152

It finds and prints 2M-1 gaps in 2M intervals in the following times, warm cache:

    2M-1 gaps Elapsed: 00:00:08.8576637 00:00:08.9123053 00:00:09.0372344 00:00:08.8545477

This is a very simple solution - it took me 10 minutes to develop. A recent college graduate can come up with it. On the database side, execution plan is a trivial merge join which uses very little CPU and memory. 

**Edit:** to be realistic, I am running client and server on separate boxes.

Answer #2

Peter Larsson (imported from SE)

And a 1 second solution...
        
    ;WITH cteSource(StartedAt, FinishedAt)
    AS (
    	SELECT		s.StartedAt,
    			e.FinishedAt
    	FROM		(
    				SELECT	StartedAt,
    					ROW_NUMBER() OVER (ORDER BY StartedAt) AS rn
    				FROM	dbo.Tasks
    			) AS s
    	INNER JOIN	(
    				SELECT	FinishedAt,
    					ROW_NUMBER() OVER (ORDER BY FinishedAt) + 1 AS rn
    				FROM	dbo.Tasks
    			) AS e ON e.rn = s.rn
    	WHERE		s.StartedAt > e.FinishedAt
    
    	UNION ALL
    
    	SELECT	MIN(StartedAt),
    		MAX(FinishedAt)
    	FROM	dbo.Tasks
    ), cteGrouped(theTime, grp)
    AS (
    	SELECT	u.theTime,
    		(ROW_NUMBER() OVER (ORDER BY u.theTime) - 1) / 2
    	FROM	cteSource AS s
    	UNPIVOT	(
    			theTime
    			FOR theColumn IN (s.StartedAt, s.FinishedAt)
    		) AS u
    )
    SELECT		MIN(theTime),
    		MAX(theTime)
    FROM		cteGrouped
    GROUP BY	grp
    ORDER BY	grp

Answer #3

Peter Larsson (imported from SE)

Here is a solution which runs in 4 seconds.

         
    WITH cteRaw(ts, type, e, s)
    AS (
    	SELECT	StartedAt,
    		1 AS type,
    		NULL,
    		ROW_NUMBER() OVER (ORDER BY StartedAt)
    	FROM	dbo.Tasks
    
    	UNION ALL
    
    	SELECT	FinishedAt,
    		-1 AS type, 
    		ROW_NUMBER() OVER (ORDER BY FinishedAt),
    		NULL
    	FROM	dbo.Tasks
    ), cteCombined(ts, e, s, se)
    AS (
    	SELECT	ts,
    		e,
    		s,
    		ROW_NUMBER() OVER (ORDER BY ts, type DESC)
    	FROM	cteRaw
    ), cteFiltered(ts, grpnum)
    AS (
    	SELECT	ts, 
    		(ROW_NUMBER() OVER (ORDER BY ts) - 1) / 2 AS grpnum
    	FROM	cteCombined
    	WHERE	COALESCE(s + s - se - 1, se - e - e) = 0
    )
    SELECT		MIN(ts) AS starttime,
    		MAX(ts) AS endtime
    FROM		cteFiltered
    GROUP BY	grpnum;

Answer #4

puzsol (imported from SE)

I think I have exhausted the limits of my knowledge in SQL server on this one....

For finding a gap in SQL server (what the C# code does), and you don't care about starting or ending gaps (those before the first start, or after the last finish), then the following query (or variants) is the fastest I could find:

    SELECT e.FinishedAt as GapStart, s.StartedAt as GapEnd
    FROM 
    (
    	SELECT StartedAt, ROW_NUMBER() OVER (ORDER BY StartedAt) AS rn
    	FROM dbo.Tasks
    ) AS s
    INNER JOIN  
    (
    	SELECT  FinishedAt, ROW_NUMBER() OVER (ORDER BY FinishedAt) + 1 AS rn
    	FROM    dbo.Tasks
    ) AS e ON e.rn = s.rn and s.StartedAt > e.FinishedAt

Which works though slight of hand that for each start-finish set, you can treat the start and finish as separate sequences, offset the finish by one and gaps are shown.

eg take (S1, F1), (S2, F2), (S3, F3), and order as: {S1, S2, S3, null} and {null, F1, F2, F3}
Then compare row n to row n in each set, and gaps are where the F set value is less than the S set value... the problem I think is that in SQL server there is no way to join or compare two separate sets purely on the order of the values in the set... hence the use of the row_number function to allow us to merge based purely on row number... but there is no way to tell SQL server that these values are unique (without inserting them into a table var with an index on it - which takes longer - I tried it), so I think the merge join is less than optimal? (though hard to prove when it's faster than anything else I could do)

I was able to get solutions using the LAG/LEAD functions:

    select * from
    (
    	SELECT top (100) percent StartedAt, FinishedAt, LEAD(StartedAt, 1, null) OVER (Order by FinishedAt) as NextStart
    	FROM dbo.Tasks
    ) as x
    where NextStart > FinishedAt
(which by the way, I don't guarantee the results - it seems to work, but I think relies on StartedAt being in order in the Tasks table... and it was slower)

Using sum change:

    select * from
    (
    	SELECT EventTime, Change, SUM(Change) OVER (ORDER BY EventTime, Change desc ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as RunTotal --, x.*
    	FROM    
    	( 
    		SELECT StartedAt AS EventTime, 1 AS Change
    		FROM dbo.Tasks
    	UNION ALL
    		SELECT  FinishedAt AS EventTime, -1 AS Change
    		FROM dbo.Tasks
    	) AS TaskEvents
    ) as x
    where x.RunTotal = 0 or (x.RunTotal = 1 and x.Change = 1)
    ORDER BY EventTime, Change DESC
(no surprise, also slower)

I even tried a CLR aggregate function (to replace the sum - it was slower than sum and relied on row_number() to keep the order of the data), and CLR a table valued function (to open two result sets and compare values based purely on sequence)... and it too was slower.  I banged my head so many times on SQL, and CLR limitations, trying many other methods...

And for what?

Running on the same machine, and spitting both the C# data, and SQL filtered data into a file (as per original C# code), the times are virtually the same.... approximately 2 seconds for the 1 gap data (C# usually faster), 8-10 seconds for the multi-gap data set (SQL usually faster).

**NOTE**: Do not use the SQL Server Development Environment for timing comparison, as it's display to grid takes time.  As tested with SQL 2012, VS2010, .net 4.0 Client profile

I will point out that both solutions perform pretty much the same sorting of data on the SQL server so the server load for fetch-sort will be similar, whichever solution you use, the only difference being the processing on the client (rather than server), and the transfer over the network.

I don't know what the difference might be when partitioning by different staff members perhaps, or when you might need extra data with the gap information (though I can't think of much else other than a staff id), or of course if there is a **slow** data connection between the SQL server and client machine (or a **slow** client)...  Nor have I made a comparison of lock-times, or contention issues, or CPU/NETWORK issues for multiple users... So I don't know which one is more likely to be a bottleneck in this case.

What I do know, is yes, SQL server is not good at this sort of set comparisons, and if you don't write the query right you will pay for it dearly.

Is it easier or harder than writing the C# version?  I'm not entirely sure, the Change +/-1, running total solution is not entirely intuitive either, and I but it's not the first solution an average graduate would come to... once done it is easy enough to copy, but it takes insight to write in the first place... same can be said for the SQL version.
Which is harder?  Which is more robust to rogue data?  Which has more potential for parallel operations?  Does it really matter when the difference is so small compared to the programming effort?

One last note; there is an unstated constraint on the data - the StartedAt **must** be less than the FinishedAt, or you will get bad results.

4 Answers