I'm running into quite a strange problem. I'm running the same script to generate data and do some matching later on, on an older 2008R2 instance. The last query (an `UPDATE`) does a single table scan and returns all of the 250.000 rows, whereas on a newer 2017 instance, the table is scanned in parallel and each of the 4 threads reads 250.000 rows, and returns 1 million "actual rows read". I changed the compatibility mode to 2008 in my 2017 instance and the actuals stayed the same, at 1.000.000. Is there any valid reason for why this would happen or does this seem like it should be a Connect item? Plans contain the same operators, but one of them does the scan in parallel and instead of splitting the 250.000 rows per each of the 4 threads (and only read 62.500 rows on each thread) all the threads read 250k each *4 = 1.000.000 Both execution plans can be found at pastetheplan: - [**2017 version here**][1] - [**2008R2 version here**][2] Also, the full script I was running can be found down below: create table #targets (id int identity(1,1), start_point int, end_point int, refference_type_id int, bla1 int, bla2 int, bla3 int, bla4 int, bla5 int, bla6 int, bla7 int, bla8 int, bla9 bit, assignedTouch varchar(10)); ;with cte as ( select 1 sp , abs(checksum(newid())) % 11 + 2 ep , 1 rn , abs(checksum(newid())) % 3 b , abs(checksum(newid())) % 3 c , abs(checksum(newid())) % 3 d , abs(checksum(newid())) % 3 e , abs(checksum(newid())) % 3 f , abs(checksum(newid())) % 3 g , abs(checksum(newid())) % 3 g2 , abs(checksum(newid())) % 3 h , abs(checksum(newid())) % 3 x3 union all select sp , abs(checksum(newid())) % 11 + 2 ep , rn + 1 , abs(checksum(newid())) % 8 , abs(checksum(newid())) % 8 , abs(checksum(newid())) % 8 , abs(checksum(newid())) % 8 , abs(checksum(newid())) % 8 , abs(checksum(newid())) % 8 , abs(checksum(newid())) % 8 , abs(checksum(newid())) % 8 , abs(checksum(newid())) % 8 from cte where rn < 250000) insert into #targets select * , 'Unassigned' [Default State] from cte option (maxrecursion 0) select top 250 * , char(abs(checksum(newid())) % 85 + 65) [class] into #matching from #targets where end_point in ( 11, 14, 22, 33 ) order by abs(checksum(newid())) % 13 update #matching set bla8 = case when bla1 - bla2 > 0 then NULL else bla8 end , bla7 = case when bla2 - bla3 > 0 then NULL else bla7 end , bla6 = case when bla4 - bla5 > 0 then NULL else bla6 end create nonclustered index nc_assignedTouch on #targets (assignedTouch); update t set assignedTouch = m.class from #targets t inner join #matching m on t.start_point = isnull(m.start_point, t.start_point) and t.bla1 = isnull(m.bla1, t.bla1) and t.bla2 = isnull(m.bla2, t.bla2) and t.bla3 = isnull(m.bla3, t.bla3) and t.bla4 = isnull(m.bla4, t.bla4) and t.bla5 = isnull(m.bla5, t.bla5) and t.bla6 = isnull(m.bla6, t.bla6) and t.bla7 = isnull(m.bla7, t.bla7) and t.bla8 = isnull(m.bla8, t.bla8) and t.bla9 = isnull(m.bla9, t.bla9) where t.assignedTouch = 'Unassigned'; sp_configure info: [![enter image description here][3]][3] [1]: https://www.brentozar.com/pastetheplan/?id=SkSy66IPX [2]: https://www.brentozar.com/pastetheplan/?id=rJ3LA6UDQ [3]: https://i.stack.imgur.com/FpvSA.png
## Parallel heap scan You might be expecting distribution among parallel threads as in the following toy example: SELECT TOP (5 * 1000 * 1000) n = ROW_NUMBER() OVER (ORDER BY @@SPID) INTO #n FROM sys.columns AS C CROSS JOIN sys.columns AS C2 CROSS JOIN sys.columns AS C3; SELECT COUNT_BIG(*) FROM #n AS N GROUP BY N.n % 10 OPTION (USE HINT ('ENABLE_PARALLEL_PLAN_PREFERENCE')); [![parallel heap scan][1]][1] In that plan, the heap table is indeed parallel scanned, with all threads coordinating to share the work of reading the whole table: [![Plan Explorer thread summary][2]][2] or in SSMS view: [![SSMS thread summary][3]][3] ## Your case But this is not the arrangement in your uploaded plan: [![uploaded plan][4]][4] The heap scan is on the **inner side of a nested loops join** [^fn1], so each thread runs a serial copy of the inner side, meaning the there are DOP (degree of parallelism) independent copies of the Table Spool and Table Scan. At DOP four, this means there are *four spools* and *four scans*, as evidenced by the `Number of Executions = 4` on the table scan. Indeed, the heap table is fully scanned four times (once per thread), giving 250,000 * 4 = 1,000,000 rows. The lazy spool caches the result of the scan per thread. So the difference is that your parallel scan is **four serial scans in parallel**, rather than four threads **cooperating to parallel scan** the heap once (as in the toy example above). It can be challenging to conceptualize the difference, but it is crucial. Once you see the branch between the two exchanges as DOP separate serial plans, it becomes easier to decode. Of course the plan is wildly inefficient, and the spool adds little value. Notice the join predicate is stuck at the nested loops join, rather than being pushed down the inner side (making the join an apply). This is due to the complex join conditions involving `ISNULL`. You might get a slightly better-performing plan by making your `nc_assignedTouch` index clustered rather than nonclustered, but the bulk of the work would still be happening at the join, and the improvement would almost certainly be minimal. A query rewrite is probably necessary here. Ask a follow-up question if you want assistance expressing the query in a more execution-friendly way. For more background on the parallelism aspects, see my article [Understanding and Using Parallelism in SQL Server][5]. [^fn1]: There is one general exception to this, where one can see a true co-operative parallel scan (and exchanges) on the inner side of a nested loops join: The outer input must be *guaranteed* to produce at most one row, and the loop join cannot have any correlated parameters (outer references). Under these conditions, the execution engine will allow parallelism on the inner side because it will always produce correct results. You may also notice that operators below an Eager Spool on the inner side of a parallel nested loops join also execute only once. There are still DOP copies of these operators, but the runtime behaviour is that only one thread builds a shared indexed structure, which is then used by all instances of the Eager Spool. I do apologise that all this is so complicated. [1]: https://i.stack.imgur.com/uDpkh.png [2]: https://i.stack.imgur.com/CD1kW.png [3]: https://i.stack.imgur.com/b0AsR.png [4]: https://i.stack.imgur.com/wvf7j.png [5]: https://www.red-gate.com/simple-talk/sql/learn-sql-server/understanding-and-using-parallelism-in-sql-server/