Can a T-SQL solution for gaps and islands run faster than a C# solution running on the client?
To be specific, let us provide some test data:
CREATE TABLE dbo.Numbers
(
n INT NOT NULL
PRIMARY KEY
) ;
GO
INSERT INTO dbo.Numbers
( n )
VALUES ( 1 ) ;
GO
DECLARE @i INT ;
SET @i = 0 ;
WHILE @i < 21
BEGIN
INSERT INTO dbo.Numbers
( n
)
SELECT n + POWER(2, @i)
FROM dbo.Numbers ;
SET @i = @i + 1 ;
END ;
GO
CREATE TABLE dbo.Tasks
(
StartedAt SMALLDATETIME NOT NULL ,
FinishedAt SMALLDATETIME NOT NULL ,
CONSTRAINT PK_Tasks PRIMARY KEY ( StartedAt, FinishedAt ) ,
CONSTRAINT UNQ_Tasks UNIQUE ( FinishedAt, StartedAt )
) ;
GO
INSERT INTO dbo.Tasks
( StartedAt ,
FinishedAt
)
SELECT DATEADD(MINUTE, n, '20100101') AS StartedAt ,
DATEADD(MINUTE, n + 2, '20100101') AS FinishedAt
FROM dbo.Numbers
WHERE ( n < 500000
OR n > 500005
)
GO
This first set of test data has exactly one gap:
SELECT StartedAt ,
FinishedAt
FROM dbo.Tasks
WHERE StartedAt BETWEEN DATEADD(MINUTE, 499999, '20100101')
AND DATEADD(MINUTE, 500006, '20100101')
The second set of test data has 2M -1 gaps, a gap between each two adjacent intervals:
TRUNCATE TABLE dbo.Tasks;
GO
INSERT INTO dbo.Tasks
( StartedAt ,
FinishedAt
)
SELECT DATEADD(MINUTE, 3*n, '20100101') AS StartedAt ,
DATEADD(MINUTE, 3*n + 2, '20100101') AS FinishedAt
FROM dbo.Numbers
WHERE ( n < 500000
OR n > 500005
)
GO
Currently I am running 2008 R2, but 2012 solutions are very welcome.
I have posted my C# solution as an answer.
The following C# code solves the problem:
var connString =
"Initial Catalog=MyDb;Data Source=MyServer;Integrated Security=SSPI;Application Name=Benchmarks;";
var stopWatch = new Stopwatch();
stopWatch.Start();
using (var conn = new SqlConnection(connString))
{
conn.Open();
var command = conn.CreateCommand();
command.CommandText = "dbo.GetAllTaskEvents";
command.CommandType = CommandType.StoredProcedure;
var gaps = new List<string>();
using (var dr = command.ExecuteReader())
{
var currentEvents = 0;
var gapStart = new DateTime();
var gapStarted = false;
while (dr.Read())
{
var change = dr.GetInt32(1);
if (change == -1 && currentEvents == 1)
{
gapStart = dr.GetDateTime(0);
gapStarted = true;
}
else if (change == 1 && currentEvents == 0 && gapStarted)
{
gaps.Add(string.Format("({0},{1})", gapStart, dr.GetDateTime(0)));
gapStarted = false;
}
currentEvents += change;
}
}
File.WriteAllLines(@"C:\Temp\Gaps.txt", gaps);
}
stopWatch.Stop();
System.Console.WriteLine("Elapsed: " + stopWatch.Elapsed);
This code invokes this stored procedure:
CREATE PROCEDURE dbo.GetAllTaskEvents
AS
BEGIN ;
SELECT EventTime ,
Change
FROM ( SELECT StartedAt AS EventTime ,
1 AS Change
FROM dbo.Tasks
UNION ALL
SELECT FinishedAt AS EventTime ,
-1 AS Change
FROM dbo.Tasks
) AS TaskEvents
ORDER BY EventTime, Change DESC ;
END ;
GO
It finds and prints one gap in 2M intervals in the following times, warm cache:
1 gap: Elapsed: 00:00:01.4852029 00:00:01.4444307 00:00:01.4644152
It finds and prints 2M-1 gaps in 2M intervals in the following times, warm cache:
2M-1 gaps Elapsed: 00:00:08.8576637 00:00:08.9123053 00:00:09.0372344 00:00:08.8545477
This is a very simple solution - it took me 10 minutes to develop. A recent college graduate can come up with it. On the database side, execution plan is a trivial merge join which uses very little CPU and memory.
**Edit:** to be realistic, I am running client and server on separate boxes.And a 1 second solution...
;WITH cteSource(StartedAt, FinishedAt)
AS (
SELECT s.StartedAt,
e.FinishedAt
FROM (
SELECT StartedAt,
ROW_NUMBER() OVER (ORDER BY StartedAt) AS rn
FROM dbo.Tasks
) AS s
INNER JOIN (
SELECT FinishedAt,
ROW_NUMBER() OVER (ORDER BY FinishedAt) + 1 AS rn
FROM dbo.Tasks
) AS e ON e.rn = s.rn
WHERE s.StartedAt > e.FinishedAt
UNION ALL
SELECT MIN(StartedAt),
MAX(FinishedAt)
FROM dbo.Tasks
), cteGrouped(theTime, grp)
AS (
SELECT u.theTime,
(ROW_NUMBER() OVER (ORDER BY u.theTime) - 1) / 2
FROM cteSource AS s
UNPIVOT (
theTime
FOR theColumn IN (s.StartedAt, s.FinishedAt)
) AS u
)
SELECT MIN(theTime),
MAX(theTime)
FROM cteGrouped
GROUP BY grp
ORDER BY grp
Here is a solution which runs in 4 seconds.
WITH cteRaw(ts, type, e, s)
AS (
SELECT StartedAt,
1 AS type,
NULL,
ROW_NUMBER() OVER (ORDER BY StartedAt)
FROM dbo.Tasks
UNION ALL
SELECT FinishedAt,
-1 AS type,
ROW_NUMBER() OVER (ORDER BY FinishedAt),
NULL
FROM dbo.Tasks
), cteCombined(ts, e, s, se)
AS (
SELECT ts,
e,
s,
ROW_NUMBER() OVER (ORDER BY ts, type DESC)
FROM cteRaw
), cteFiltered(ts, grpnum)
AS (
SELECT ts,
(ROW_NUMBER() OVER (ORDER BY ts) - 1) / 2 AS grpnum
FROM cteCombined
WHERE COALESCE(s + s - se - 1, se - e - e) = 0
)
SELECT MIN(ts) AS starttime,
MAX(ts) AS endtime
FROM cteFiltered
GROUP BY grpnum;I think I have exhausted the limits of my knowledge in SQL server on this one....
For finding a gap in SQL server (what the C# code does), and you don't care about starting or ending gaps (those before the first start, or after the last finish), then the following query (or variants) is the fastest I could find:
SELECT e.FinishedAt as GapStart, s.StartedAt as GapEnd
FROM
(
SELECT StartedAt, ROW_NUMBER() OVER (ORDER BY StartedAt) AS rn
FROM dbo.Tasks
) AS s
INNER JOIN
(
SELECT FinishedAt, ROW_NUMBER() OVER (ORDER BY FinishedAt) + 1 AS rn
FROM dbo.Tasks
) AS e ON e.rn = s.rn and s.StartedAt > e.FinishedAt
Which works though slight of hand that for each start-finish set, you can treat the start and finish as separate sequences, offset the finish by one and gaps are shown.
eg take (S1, F1), (S2, F2), (S3, F3), and order as: {S1, S2, S3, null} and {null, F1, F2, F3}
Then compare row n to row n in each set, and gaps are where the F set value is less than the S set value... the problem I think is that in SQL server there is no way to join or compare two separate sets purely on the order of the values in the set... hence the use of the row_number function to allow us to merge based purely on row number... but there is no way to tell SQL server that these values are unique (without inserting them into a table var with an index on it - which takes longer - I tried it), so I think the merge join is less than optimal? (though hard to prove when it's faster than anything else I could do)
I was able to get solutions using the LAG/LEAD functions:
select * from
(
SELECT top (100) percent StartedAt, FinishedAt, LEAD(StartedAt, 1, null) OVER (Order by FinishedAt) as NextStart
FROM dbo.Tasks
) as x
where NextStart > FinishedAt
(which by the way, I don't guarantee the results - it seems to work, but I think relies on StartedAt being in order in the Tasks table... and it was slower)
Using sum change:
select * from
(
SELECT EventTime, Change, SUM(Change) OVER (ORDER BY EventTime, Change desc ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as RunTotal --, x.*
FROM
(
SELECT StartedAt AS EventTime, 1 AS Change
FROM dbo.Tasks
UNION ALL
SELECT FinishedAt AS EventTime, -1 AS Change
FROM dbo.Tasks
) AS TaskEvents
) as x
where x.RunTotal = 0 or (x.RunTotal = 1 and x.Change = 1)
ORDER BY EventTime, Change DESC
(no surprise, also slower)
I even tried a CLR aggregate function (to replace the sum - it was slower than sum and relied on row_number() to keep the order of the data), and CLR a table valued function (to open two result sets and compare values based purely on sequence)... and it too was slower. I banged my head so many times on SQL, and CLR limitations, trying many other methods...
And for what?
Running on the same machine, and spitting both the C# data, and SQL filtered data into a file (as per original C# code), the times are virtually the same.... approximately 2 seconds for the 1 gap data (C# usually faster), 8-10 seconds for the multi-gap data set (SQL usually faster).
**NOTE**: Do not use the SQL Server Development Environment for timing comparison, as it's display to grid takes time. As tested with SQL 2012, VS2010, .net 4.0 Client profile
I will point out that both solutions perform pretty much the same sorting of data on the SQL server so the server load for fetch-sort will be similar, whichever solution you use, the only difference being the processing on the client (rather than server), and the transfer over the network.
I don't know what the difference might be when partitioning by different staff members perhaps, or when you might need extra data with the gap information (though I can't think of much else other than a staff id), or of course if there is a **slow** data connection between the SQL server and client machine (or a **slow** client)... Nor have I made a comparison of lock-times, or contention issues, or CPU/NETWORK issues for multiple users... So I don't know which one is more likely to be a bottleneck in this case.
What I do know, is yes, SQL server is not good at this sort of set comparisons, and if you don't write the query right you will pay for it dearly.
Is it easier or harder than writing the C# version? I'm not entirely sure, the Change +/-1, running total solution is not entirely intuitive either, and I but it's not the first solution an average graduate would come to... once done it is easy enough to copy, but it takes insight to write in the first place... same can be said for the SQL version.
Which is harder? Which is more robust to rogue data? Which has more potential for parallel operations? Does it really matter when the difference is so small compared to the programming effort?
One last note; there is an unstated constraint on the data - the StartedAt **must** be less than the FinishedAt, or you will get bad results.