SE Anon or Wiki user (imported from SE)
Is there a general selectivity rule for when to apply a nonclustered index?
We know not to create an index on a 50/50 `bit` column:
>[Should I index a bit field in SQL Server?][1]:
>
>Rows with 50/50 distribution, it might buy you very little performance gain
How selective should a SQL Server query be before an index should be applied?
Is there a general rule or guideline? At 25% average selectivity distribution in a column? 10% selectivity?
This article by Tibor Karaszi is stating around 31%: [How selective do we need to be for an index to be used?][2]
[1]: https://stackoverflow.com/questions/231125/should-i-index-a-bit-field-in-sql-server
[2]: http://sqlblog.karaszi.com/how-selective-do-we-need-to-be-for-an-index-to-be-used/
Considering column selectivity only when deciding which columns to index ignores quite a bit of what indexes can do, and what they're generally good for.
For instance, you may have an identity or guid column that's incredibly selective -- unique, even -- but never gets used. In that case, who cares? Why index columns that queries don't touch?
Much less selective indexes, even `BIT` columns, can make useful, or useful parts of indexes. In some scenarios, very un-selective columns on large tables can benefit quite a bit from indexing when they need to be sorted on, or grouped by.
**Joins**
Take this query:
SELECT COUNT(*) AS records
FROM dbo.Users AS u
JOIN dbo.Posts AS p
ON u.Id = p.OwnerUserId;
Without a helpful index on `OwnerUserId`, this is our plan with a Hash Join -- which spills -- but that's secondary to the point.
[![NUTS][1]][1]
With a helpful index -- `CREATE INDEX ix_yourmom ON dbo.Posts (OwnerUserId);` -- our plan changes.
[![NUTS][2]][2]
**Aggregates**
Likewise, grouping operations can benefit from indexing.
SELECT p.OwnerUserId, COUNT(*) AS records
FROM dbo.Posts AS p
GROUP BY p.OwnerUserId;
Without an index:
[![NUTS][3]][3]
With an index:
[![NUTS][4]][4]
**Sorts**
Sorting data can be another sticking point in queries that indexes can help.
Without an index:
[![NUTS][5]][5]
With our index:
[![NUTS][6]][6]
**Blocking**
Indexes can also help avoid blocking pile-ups.
If we try to run this update:
UPDATE p
SET p.Score += 100
FROM dbo.Posts AS p
WHERE p.OwnerUserId = 22656;
And concurrently run this select:
SELECT *
FROM dbo.Posts AS p
WHERE p.OwnerUserId = 8;
They'll end up blocking:
[![NUTS][7]][7]
With our index in place, the select finishes instantly without being blocked. SQL Server has a way to access the data it needs efficiently.
In case you're wondering (using the equation Kumar provided) the OwnerUserId column's selectivity is `0.0701539878296839478`
**Wrap it up**
Don't just blindly index columns based on how selective they are. Design indexes that help your workload run efficiently. Using more selective columns as leading key columns is generally a good idea when you're searching for equality predicates, but can be less helpful when searching on ranges.
[1]: https://i.stack.imgur.com/09wP6.jpg
[2]: https://i.stack.imgur.com/MrkF7.jpg
[3]: https://i.stack.imgur.com/Puo9c.jpg
[4]: https://i.stack.imgur.com/Hs9oA.jpg
[5]: https://i.stack.imgur.com/chlXI.jpg
[6]: https://i.stack.imgur.com/YvXT8.jpg
[7]: https://i.stack.imgur.com/VI6nw.jpg
What you refer to is the "tipping point" at which the SQL Server optimizer decides to go with a table scan instead of an index seek and key lookup.
There are some things to note about that, the tipping point obviously only affects non-clustered indexes (as the key lookup doesn't have to occur when using a clustered index) and the tipping point also doesn't come into play when your non-clustered index is covering (either all selected columns are in the key columns or in the included columns of the index).
That being said, the number of rows is *not* 30% of the total rows. It's not a fixed value.
The number of rows is somewhere between 25% and 33% percent of the number of pages, so unless you have 1 row per page the percentage of rows is much much smaller.
See the examples by Kimberly Tripp in [The Tipping Point Query Answers](https://www.sqlskills.com/blogs/kimberly/the-tipping-point-query-answers/)
> - If a table has 500,000 pages then 25% = 125,000 and 33% = 166,000. So, somewhere between 125,000 and 166,000 ROWS the query will tip.
> Turning that into a percentage 125,000/1million = 12.5% and
> 166,000/1million = 16.6%. So, if a table has 500,000 pages (and 1
> million rows) then queries that return less than 12.5% of the data are
> likely to USE the nonclustered index to lookup the data and queries
> over 16.6% of the data are LIKELY to use a table scan.
>
> - If a table has 10,000 pages then 25% = 2,500 and 33% = 3,333. So, somewhere between 2,500 and 3,333 ROWS the query will tip. Turning
> that into a percentage 2,500/1million = .25% and 3,333/1million = .33%
> (not even 1%). So, if a table has only 10,000 pages (and 1 million
> rows) then queries that return less than a quarter of 1% of the data
> are likely to USE the nonclustered index to lookup the data and
> queries over one third of one percent are LIKELY to use a table scan.
>
> - If a table has 50,000 pages then 25% = 12,500 and 33% = 16,666. So,
> somewhere between 12,500 and 16,666 ROWS the query will tip. Turning
> that into a percentage 12,500/1million = 1.25% and 16,666/1million =
> 1.66% (under 2%). So, if a table has 50,000 pages (and 1 million rows) then queries that return less than 1.25% of the data are likely to USE
> the nonclustered index to lookup the data and queries over 1.66% are
> LIKELY to use a table scan.
Now to answer your question, how selective should it be before an index is used?
Depending on your row size and number of pages it could be *very* selective. If you want to make sure your index is used you want your index to be covering.