How Selective Should An Index Be?

sql-server add tag

SE Anon or Wiki user (imported from SE)

Is there a general selectivity rule for when to apply a nonclustered index? 

We know not to create an index on a 50/50 `bit` column:

>[Should I index a bit field in SQL Server?][1]:
>
>Rows with 50/50 distribution, it might buy you very little performance gain

How selective should a SQL Server query be before an index should be applied?

Is there a general rule or guideline? At 25% average selectivity distribution in a column? 10% selectivity? 

This article by Tibor Karaszi is stating around 31%: [How selective do we need to be for an index to be used?][2]

  [1]: https://stackoverflow.com/questions/231125/should-i-index-a-bit-field-in-sql-server
  [2]: http://sqlblog.karaszi.com/how-selective-do-we-need-to-be-for-an-index-to-be-used/

Top Answer

meme (imported from SE)

Considering column selectivity only when deciding which columns to index ignores quite a bit of what indexes can do, and what they're generally good for.

For instance, you may have an identity or guid column that's incredibly selective -- unique, even -- but never gets used. In that case, who cares? Why index columns that queries don't touch?

Much less selective indexes, even `BIT` columns, can make useful, or useful parts of indexes. In some scenarios, very un-selective columns on large tables can benefit quite a bit from indexing when they need to be sorted on, or grouped by.

**Joins**

Take this query:

    SELECT COUNT(*) AS records
    FROM dbo.Users AS u
    JOIN dbo.Posts AS p
    ON u.Id = p.OwnerUserId;

Without a helpful index on `OwnerUserId`, this is our plan with a Hash Join -- which spills -- but that's secondary to the point.

[![NUTS][1]][1]
 
With a helpful index -- `CREATE INDEX ix_yourmom ON dbo.Posts (OwnerUserId);` -- our plan changes.

[![NUTS][2]][2]

**Aggregates**

Likewise, grouping operations can benefit from indexing.

    SELECT   p.OwnerUserId, COUNT(*) AS records
    FROM     dbo.Posts AS p
    GROUP BY p.OwnerUserId;

Without an index:

[![NUTS][3]][3]

With an index:

[![NUTS][4]][4]

**Sorts**

Sorting data can be another sticking point in queries that indexes can help.

Without an index:

[![NUTS][5]][5]

With our index:

[![NUTS][6]][6]

**Blocking**

Indexes can also help avoid blocking pile-ups. 

If we try to run this update:

    UPDATE p
    SET p.Score += 100
    FROM dbo.Posts AS p
    WHERE p.OwnerUserId = 22656;

And concurrently run this select:

    SELECT *
    FROM   dbo.Posts AS p
    WHERE  p.OwnerUserId = 8;

They'll end up blocking:

[![NUTS][7]][7]

With our index in place, the select finishes instantly without being blocked. SQL Server has a way to access the data it needs efficiently.

In case you're wondering (using the equation Kumar provided) the OwnerUserId column's selectivity is `0.0701539878296839478`

**Wrap it up**

Don't just blindly index columns based on how selective they are. Design indexes that help your workload run efficiently. Using more selective columns as leading key columns is generally a good idea when you're searching for equality predicates, but can be less helpful when searching on ranges. 

  [1]: https://i.stack.imgur.com/09wP6.jpg
  [2]: https://i.stack.imgur.com/MrkF7.jpg
  [3]: https://i.stack.imgur.com/Puo9c.jpg
  [4]: https://i.stack.imgur.com/Hs9oA.jpg
  [5]: https://i.stack.imgur.com/chlXI.jpg
  [6]: https://i.stack.imgur.com/YvXT8.jpg
  [7]: https://i.stack.imgur.com/VI6nw.jpg

Answer #2

Tom V (imported from SE)

What you refer to is the "tipping point" at which the SQL Server optimizer decides to go with a table scan instead of an index seek and key lookup.

There are some things to note about that, the tipping point obviously only affects non-clustered indexes (as the key lookup doesn't have to occur when using a clustered index) and the tipping point also doesn't come into play when your non-clustered index is covering (either all selected columns are in the key columns or in the included columns of the index).

That being said, the number of rows is *not* 30% of the total rows. It's not a fixed value.

The number of rows is somewhere between 25% and 33% percent of the number of pages, so unless you have 1 row per page the percentage of rows is much much smaller.

See the examples by Kimberly Tripp in [The Tipping Point Query Answers](https://www.sqlskills.com/blogs/kimberly/the-tipping-point-query-answers/)

>  - If a table has 500,000 pages then 25% = 125,000 and 33% = 166,000. So, somewhere between 125,000 and 166,000 ROWS the query will tip.
> Turning that into a percentage 125,000/1million = 12.5% and
> 166,000/1million = 16.6%. So, if a table has 500,000 pages (and 1
> million rows) then queries that return less than 12.5% of the data are
> likely to USE the nonclustered index to lookup the data and queries
> over 16.6% of the data are LIKELY to use a table scan.
> 
>  - If a table has 10,000 pages then 25% = 2,500 and 33% = 3,333. So, somewhere between 2,500 and 3,333 ROWS the query will tip. Turning
> that into a percentage 2,500/1million = .25% and 3,333/1million = .33%
> (not even 1%). So, if a table has only 10,000 pages (and 1 million
> rows) then queries that return less than a quarter of 1% of the data
> are likely to USE the nonclustered index to lookup the data and
> queries over one third of one percent are LIKELY to use a table scan. 
> 
>  - If a table has 50,000 pages then 25% = 12,500 and 33% = 16,666. So,
> somewhere between 12,500 and 16,666 ROWS the query will tip. Turning
> that into a percentage 12,500/1million = 1.25% and 16,666/1million =
> 1.66% (under 2%). So, if a table has 50,000 pages (and 1 million rows) then queries that return less than 1.25% of the data are likely to USE
> the nonclustered index to lookup the data and queries over 1.66% are
> LIKELY to use a table scan.

Now to answer your question, how selective should it be before an index is used?

Depending on your row size and number of pages it could be *very* selective. If you want to make sure your index is used you want your index to be covering.

2 Answers