Paul White (imported from SE)
Given the following heap table with 400 rows numbered from 1 to 400:


DROP TABLE IF EXISTS dbo.N;
GO
SELECT
SV.number
INTO dbo.N
FROM master.dbo.spt_values AS SV
WHERE
SV.[type] = N'P'
AND SV.number BETWEEN 1 AND 400;


and the following settings:


SET NOCOUNT ON;
SET STATISTICS IO, TIME OFF;
SET STATISTICS XML OFF;
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;


The following SELECT statement completes in around **6 seconds** ([demo][1], [plan][2]):


DECLARE @n integer = 400;

SELECT
c = COUNT_BIG(*)
FROM dbo.N AS N
CROSS JOIN dbo.N AS N2
CROSS JOIN dbo.N AS N3
WHERE
N.number <= @n
AND N2.number <= @n
AND N3.number <= @n
OPTION
(OPTIMIZE FOR (@n = 1));


*Note: The OPTIMIZE FOR clause is just for the sake of producing a sensibly-sized repro that captures the essential details of the real problem, including a cardinality misestimate that can arise for a variety of reasons.*

When the single-row output is written to a table, it takes **19 seconds** ([demo][3], [plan][4]):


DECLARE @T table (c bigint NOT NULL);

DECLARE @n integer = 400;

INSERT @T
(c)
SELECT
c = COUNT_BIG(*)
FROM dbo.N AS N
CROSS JOIN dbo.N AS N2
CROSS JOIN dbo.N AS N3
WHERE
N.number <= @n
AND N2.number <= @n
AND N3.number <= @n
OPTION
(OPTIMIZE FOR (@n = 1));


The execution plans appear identical aside from the insert of one row.

All the extra time seems to be consumed by CPU usage.

Why is the INSERT statement so much slower?

[1]: https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=3a1563d5014e69c90fc13e4fa33459d7
[2]: https://www.brentozar.com/pastetheplan/?id=ByCrEMbIE
[3]: https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=35e46790034e0345c64ea2ca6fa78570
[4]: https://www.brentozar.com/pastetheplan/?id=HJLT4fb84
Paul White (imported from SE)
SQL Server chooses to scan the heap tables on the inner side of the loops joins using row-level locks. A full scan would normally choose page-level locking, but a combination of the size of the table and the predicate means the storage engine chooses row locks, since that appears to be the cheapest strategy.

The cardinality misestimation deliberately introduced by the OPTIMIZE FOR means that the heaps are scanned *many* more times than the optimizer expects, and it does not introduce a spool as it normally would.

This combination of factors means that performance is very sensitive to the number of locks required at runtime.

The SELECT statement benefits from an optimization that allows [row-level shared locks to be skipped][1] (taking only intent-shared page-level locks) when there is no danger of reading uncommitted data, and there is no off-row data.

The INSERT...SELECT statement does not benefit from this optimization, so millions of RID locks are taken and released each second in the second case, along with the intent-shared page-level locks.

The enormous amount of locking activity accounts for the extra CPU and elapsed time.

The most natural workaround is to ensure the optimizer (and storage engine) get decent cardinality estimates so they can make good choices.

If that is not practical in the real use case, the INSERT and SELECT statements could be separated, with the result of the SELECT held in a variable. This will allow the SELECT statement to benefit from the lock-skipping optimization.

Changing the isolation level can also be made to work, either by not taking shared locks, or by ensuring that lock escalation takes places quickly.

As a final point of interest, the query can be made to run even faster than the optimized SELECT case by forcing the use of spools using undocumented trace flag 8691.

[1]: https://www.sql.kiwi/2010/11/read-committed-shared-locks-and-rollbacks.html