## Setup
DROP TABLE IF EXISTS #EmptyTable, #BigTable
CREATE TABLE #EmptyTable(A int);
CREATE TABLE #BigTable(A int);
INSERT INTO #BigTable
SELECT TOP 10000000 CRYPT_GEN_RANDOM(3)
FROM sys.all_objects o1,
sys.all_objects o2,
sys.all_objects o3;
## Query
WITH agg
AS (SELECT DISTINCT a
FROM #BigTable)
SELECT *
FROM #EmptyTable E
INNER HASH JOIN agg B
ON B.A = E.A;
## Execution Plan
[![enter image description here][1]][1]
## Problem
This is a simplified repro for a phenomenon I hadn't noticed before today. My expectation for an inner hash join would be that if the build input is empty the probe side should not be executed as the join can return no rows. The above example contradicts that and reads the 10 million rows from the table. This adds 2.196 seconds to the execution time of the query (99.9%).
## Additional Observations
1. With `OPTION (MAXDOP 1)` the execution plan reads no rows from `#BigTable`. The `ActualExecutions` is `0` for all operators on the inside of the hash join.
2. For the query `SELECT * FROM #EmptyTable E INNER HASH JOIN #BigTable B ON B.A = E.A`- I get a parallel plan, the scan operator on the inside of the hash join does have `ActualExecutions` of DOP but still no rows are read. This plan has no repartition streams operator (or aggregate)
## Question
What's going on here? Why does the original plan exhibit the problem and the other cases don't?
[1]: https://i.stack.imgur.com/XC3O9.png
Top Answer
Paul White
Not running the probe side of the join when the build is empty is an optimization. It is not available for *parallel row mode hash join* when the probe side has a child branch i.e. when there is an exchange operator.
There was a similar report many years ago by Adam Machanic on the now-defunct Connect feedback site. The scenario was a start-up Filter on the probe side, which ran its child operators unexpectedly. The answer from Microsoft was that the engine requires a guarantee that certain structures are initialized, and the only sane way to enforce that was to ensure the probe side operators are opened.
My own recollection of the details is that not initializing a sub-tree led to hard-to-fix parallel timing bugs. Ensuring the child branch started up was a work around for those problems.
Batch mode hash join does not have this side effect because the way threads are managed is different.
In your particular case, the effect is more pronounced because the hash aggregate is blocking; it consumes its entire input during the iterator's Open() call. When there are only streaming operators on the probe side, the performance impact will often be more limited, depending on how much work is required to return the first row to the probe side of the hash join.