A K (imported from SE)
Can a T-SQL solution for gaps and islands run faster than a C# solution running on the client?

To be specific, let us provide some test data:

CREATE TABLE dbo.Numbers
(
n INT NOT NULL
PRIMARY KEY
) ;
GO

INSERT  INTO dbo.Numbers
( n )
VALUES  ( 1 ) ;
GO
DECLARE @i INT ;
SET @i = 0 ;
WHILE @i < 21
BEGIN
INSERT  INTO dbo.Numbers
( n
)
SELECT  n + POWER(2, @i)
FROM    dbo.Numbers ;
SET @i = @i + 1 ;
END ;
GO

(
StartedAt SMALLDATETIME NOT NULL ,
FinishedAt SMALLDATETIME NOT NULL ,
CONSTRAINT PK_Tasks PRIMARY KEY ( StartedAt, FinishedAt ) ,
CONSTRAINT UNQ_Tasks UNIQUE ( FinishedAt, StartedAt )
) ;
GO

( StartedAt ,
FinishedAt
)
SELECT  DATEADD(MINUTE, n, '20100101') AS StartedAt ,
DATEADD(MINUTE, n + 2, '20100101') AS FinishedAt
FROM    dbo.Numbers
WHERE   ( n < 500000
OR n > 500005
)
GO

This first set of test data has exactly one gap:

SELECT  StartedAt ,
FinishedAt
WHERE   StartedAt BETWEEN DATEADD(MINUTE, 499999, '20100101')

The second set of test data has 2M -1 gaps, a gap between each two adjacent intervals:

GO

( StartedAt ,
FinishedAt
)
SELECT  DATEADD(MINUTE, 3*n, '20100101') AS StartedAt ,
DATEADD(MINUTE, 3*n + 2, '20100101') AS FinishedAt
FROM    dbo.Numbers
WHERE   ( n < 500000
OR n > 500005
)
GO

Currently I am running 2008 R2, but 2012 solutions are very welcome.
I have posted my C# solution as an answer.

A K (imported from SE)
The following C# code solves the problem:

var connString =
"Initial Catalog=MyDb;Data Source=MyServer;Integrated Security=SSPI;Application Name=Benchmarks;";

var stopWatch = new Stopwatch();
stopWatch.Start();

using (var conn = new SqlConnection(connString))
{
conn.Open();
var command = conn.CreateCommand();
command.CommandType = CommandType.StoredProcedure;
var gaps = new List<string>();
{
var currentEvents = 0;
var gapStart = new DateTime();
var gapStarted = false;
{
var change = dr.GetInt32(1);
if (change == -1 && currentEvents == 1)
{
gapStart = dr.GetDateTime(0);
gapStarted = true;
}
else if (change == 1 && currentEvents == 0 && gapStarted)
{
gapStarted = false;
}
currentEvents += change;
}
}
File.WriteAllLines(@"C:\Temp\Gaps.txt", gaps);
}

stopWatch.Stop();
System.Console.WriteLine("Elapsed: " + stopWatch.Elapsed);

This code invokes this stored procedure:

AS
BEGIN ;
SELECT  EventTime ,
Change
FROM    ( SELECT  StartedAt AS EventTime ,
1 AS Change
UNION ALL
SELECT  FinishedAt AS EventTime ,
-1 AS Change
ORDER BY EventTime, Change DESC ;
END ;
GO

It finds and prints one gap in 2M intervals in the following times, warm cache:

1 gap: Elapsed: 00:00:01.4852029 00:00:01.4444307 00:00:01.4644152

It finds and prints 2M-1 gaps in 2M intervals in the following times, warm cache:

2M-1 gaps Elapsed: 00:00:08.8576637 00:00:08.9123053 00:00:09.0372344 00:00:08.8545477

This is a very simple solution - it took me 10 minutes to develop. A recent college graduate can come up with it. On the database side, execution plan is a trivial merge join which uses very little CPU and memory.

**Edit:** to be realistic, I am running client and server on separate boxes.
And a 1 second solution...

;WITH cteSource(StartedAt, FinishedAt)
AS (
SELECT		s.StartedAt,
e.FinishedAt
FROM		(
SELECT	StartedAt,
ROW_NUMBER() OVER (ORDER BY StartedAt) AS rn
) AS s
INNER JOIN	(
SELECT	FinishedAt,
ROW_NUMBER() OVER (ORDER BY FinishedAt) + 1 AS rn
) AS e ON e.rn = s.rn
WHERE		s.StartedAt > e.FinishedAt

UNION ALL

SELECT	MIN(StartedAt),
MAX(FinishedAt)
), cteGrouped(theTime, grp)
AS (
SELECT	u.theTime,
(ROW_NUMBER() OVER (ORDER BY u.theTime) - 1) / 2
FROM	cteSource AS s
UNPIVOT	(
theTime
FOR theColumn IN (s.StartedAt, s.FinishedAt)
) AS u
)
SELECT		MIN(theTime),
MAX(theTime)
FROM		cteGrouped
GROUP BY	grp
ORDER BY	grp


Here is a solution which runs in 4 seconds.

WITH cteRaw(ts, type, e, s)
AS (
SELECT	StartedAt,
1 AS type,
NULL,
ROW_NUMBER() OVER (ORDER BY StartedAt)

UNION ALL

SELECT	FinishedAt,
-1 AS type,
ROW_NUMBER() OVER (ORDER BY FinishedAt),
NULL
), cteCombined(ts, e, s, se)
AS (
SELECT	ts,
e,
s,
ROW_NUMBER() OVER (ORDER BY ts, type DESC)
FROM	cteRaw
), cteFiltered(ts, grpnum)
AS (
SELECT	ts,
(ROW_NUMBER() OVER (ORDER BY ts) - 1) / 2 AS grpnum
FROM	cteCombined
WHERE	COALESCE(s + s - se - 1, se - e - e) = 0
)
SELECT		MIN(ts) AS starttime,
MAX(ts) AS endtime
FROM		cteFiltered
GROUP BY	grpnum;
puzsol (imported from SE)
I think I have exhausted the limits of my knowledge in SQL server on this one....

For finding a gap in SQL server (what the C# code does), and you don't care about starting or ending gaps (those before the first start, or after the last finish), then the following query (or variants) is the fastest I could find:

SELECT e.FinishedAt as GapStart, s.StartedAt as GapEnd
FROM
(
SELECT StartedAt, ROW_NUMBER() OVER (ORDER BY StartedAt) AS rn
) AS s
INNER JOIN
(
SELECT  FinishedAt, ROW_NUMBER() OVER (ORDER BY FinishedAt) + 1 AS rn
) AS e ON e.rn = s.rn and s.StartedAt > e.FinishedAt

Which works though slight of hand that for each start-finish set, you can treat the start and finish as separate sequences, offset the finish by one and gaps are shown.

eg take (S1, F1), (S2, F2), (S3, F3), and order as: {S1, S2, S3, null} and {null, F1, F2, F3}
Then compare row n to row n in each set, and gaps are where the F set value is less than the S set value... the problem I think is that in SQL server there is no way to join or compare two separate sets purely on the order of the values in the set... hence the use of the row_number function to allow us to merge based purely on row number... but there is no way to tell SQL server that these values are unique (without inserting them into a table var with an index on it - which takes longer - I tried it), so I think the merge join is less than optimal? (though hard to prove when it's faster than anything else I could do)

I was able to get solutions using the LAG/LEAD functions:

select * from
(
SELECT top (100) percent StartedAt, FinishedAt, LEAD(StartedAt, 1, null) OVER (Order by FinishedAt) as NextStart
) as x
where NextStart > FinishedAt
(which by the way, I don't guarantee the results - it seems to work, but I think relies on StartedAt being in order in the Tasks table... and it was slower)

Using sum change:

select * from
(
SELECT EventTime, Change, SUM(Change) OVER (ORDER BY EventTime, Change desc ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as RunTotal --, x.*
FROM
(
SELECT StartedAt AS EventTime, 1 AS Change
UNION ALL
SELECT  FinishedAt AS EventTime, -1 AS Change
) as x
where x.RunTotal = 0 or (x.RunTotal = 1 and x.Change = 1)
ORDER BY EventTime, Change DESC
(no surprise, also slower)

I even tried a CLR aggregate function (to replace the sum - it was slower than sum and relied on row_number() to keep the order of the data), and CLR a table valued function (to open two result sets and compare values based purely on sequence)... and it too was slower.  I banged my head so many times on SQL, and CLR limitations, trying many other methods...

And for what?

Running on the same machine, and spitting both the C# data, and SQL filtered data into a file (as per original C# code), the times are virtually the same.... approximately 2 seconds for the 1 gap data (C# usually faster), 8-10 seconds for the multi-gap data set (SQL usually faster).

**NOTE**: Do not use the SQL Server Development Environment for timing comparison, as it's display to grid takes time.  As tested with SQL 2012, VS2010, .net 4.0 Client profile

I will point out that both solutions perform pretty much the same sorting of data on the SQL server so the server load for fetch-sort will be similar, whichever solution you use, the only difference being the processing on the client (rather than server), and the transfer over the network.

I don't know what the difference might be when partitioning by different staff members perhaps, or when you might need extra data with the gap information (though I can't think of much else other than a staff id), or of course if there is a **slow** data connection between the SQL server and client machine (or a **slow** client)...  Nor have I made a comparison of lock-times, or contention issues, or CPU/NETWORK issues for multiple users... So I don't know which one is more likely to be a bottleneck in this case.

What I do know, is yes, SQL server is not good at this sort of set comparisons, and if you don't write the query right you will pay for it dearly.

Is it easier or harder than writing the C# version?  I'm not entirely sure, the Change +/-1, running total solution is not entirely intuitive either, and I but it's not the first solution an average graduate would come to... once done it is easy enough to copy, but it takes insight to write in the first place... same can be said for the SQL version.
Which is harder?  Which is more robust to rogue data?  Which has more potential for parallel operations?  Does it really matter when the difference is so small compared to the programming effort?

One last note; there is an unstated constraint on the data - the StartedAt **must** be less than the FinishedAt, or you will get bad results.