For a moderately complex query I am trying to optimize, I noticed that removing the `TOP n` clause changes the execution plan. I would have guessed that when a query includes `TOP n` the database engine would run the query ignoring the the `TOP` clause, and then at the end just shrink that result set down to the *n* number of rows that was requested. The graphical execution plan seems to indicate this is the case -- `TOP` is the "last" step. But it appears there is more going on.
**My question is, how (and why) does a TOP n clause impact the execution plan of a query?**
Here is a simplified version of what is going on in my case:
The query is matching rows from two tables, A and B.
Without the `TOP` clause, the optimizer estimates there will be 19k rows from table A and 46k rows from table B. The actual number of rows returned is 16k for A and 13k for B. A hash match is used to join these two results sets for a total of 69 rows (then a sort is applied). This query happens very quickly.
When I add `TOP 1001` the optimizer does not use a hash match; instead it first sorts the results from table A (same estimate/actual of 19k/16k) and does a nested loop against table B. The estimated number of rows for table B is now 1, and the strange thing is that the `TOP n` directly affects the estimated number of executions (index seek) against B -- it appears to always be *2n+1*, or in my case 2003. This estimate changes accordingly if I change `TOP n`. Of course, since this is a nested join the actual number of executions is 16k (the number of rows from table A) and this slows down the query.
The actual scenario is a bit more complex but this captures the basic idea/behavior. Both tables are searched using index seeks. This is SQL Server 2008 R2 Enterprise edition.
> *I would have guessed that when a query includes TOP n the database
> engine would run the query ignoring the the TOP clause, and then at
> the end just shrink that result set down to the n number of rows that
> was requested. The graphical execution plan seems to indicate this is
> the case -- TOP is the "last" step. But it appears there is more going
The way the above is phrased makes me think you may have an incorrect mental picture of how a query executes. An operator in a query plan is not a *step* (where the full result set of a previous *step* is evaluated by the next one.
SQL Server uses a *pipelined* execution model, where each operator exposes methods like *Init()*, *GetRow()*, and *Close()*. As the *GetRow()* name suggests, an operator produces one row at a time on demand (as required by its parent operator). This is documented in the Books Online [Logical and Physical Operators reference], with more detail in my blog post [Why Query Plans Run Backwards]. This row-at-a-time model is essential in forming a sound intuition for query execution.
> My question is, how (and why) does a `TOP` n clause impact the execution
> plan of a query?
Some logical operations like `TOP`, semi joins and the `FAST n` [query hint] affect the way the query optimizer costs execution plan alternatives. The basic idea is that one possible plan shape might return the first *n* rows more quickly than a different plan that was optimized to return all rows.
For example, indexed nested loops join is often the fastest way to return a small number of rows, though hash or merge join with scans might be more efficient on larger sets. The way the query optimizer reasons about these choices is by setting a [Row Goal] at a particular point in the logical tree of operations.
A row goal modifies the way query plan alternatives are costed. The essence of it is that the optimizer starts by costing each operator as if the full result set were required, sets a row goal at the appropriate point, and then works back down the plan tree estimating the number of rows it expects to need to examine to meet the row goal.
For example, a logical `TOP(10)` sets a row goal of 10 at a particular point in the logical query tree. The costs of operators leading up to the row goal are modified to estimate how many rows they need to produce to meet the row goal. This calculation can become complex, so it is easier to understand all this with a [fully worked example] and annotated execution plans. Row goals can affect more than the choice of join type or whether seeks and lookups are preferred to scans. More details on that [here].
As always, an execution plan selected on the basis of a row goal is subject to the optimizer's reasoning abilities and the quality of information provided to it. Not every plan with a row goal will produce the required number of rows faster in practice, but according to the costing model it will.
Where a row goal plan proves not to be faster, there are usually ways to modify the query or provide better information to the optimizer such that the naturally selected plan is best. Which option is appropriate in your case depends on the details of course. The row goal feature is generally very effective (though there is a [bug] to watch out for when used in parallel execution plans).
Your particular query and plan may not be suitable for detailed analysis here (by all means provide an actual execution plan if you wish) but hopefully the ideas outlined here will allow you to make forward progress.