What are the benefits of not having a clustered index on a table in SQL Server?
Will:
SELECT * INTO TABLE_A FROM TABLE_B
Be faster if `TABLE_A` is a heap?
Which operation(s) will benefit if the table is a heap?
I am quite sure `UPDATE`s and `DELETE`s will benefit from a clustered index. What about `INSERT`s? My understanding is that `INSERT` "might" benefit from the table being a heap, both in term of speed but also other resources and hardware (I/O, CPU, memory and storage...).
What is the most scarce resource in terms of hardware? In terms of storage is a heap going to occupy less space? Is disk storage not the least expensive resource? If so is it rational to keep table as heap in order to save disk space? How will a heap affect CPU and I/O with `SELECT`, `INSERT`, `UPDATE` and `DELETE`? What cost goes up when table is a heap and we `SELECT`, `UPDATE` and `DELETE` from it?
I'm only addressing the question in the title. The question body has too many different questions to be answered in a Top Answer. Heaps have the following advantages over tables with clustered indexes:
1. Tables with clustered indexes physically organize the data in order of those indexes. The SQL Server query optimizer may add a sort to a query plan for large enough inserts if the data to be inserted isn't already sorted in clustered key order. The additional work performed by the sort is not free and the query can run slower as a result. Inserts into non-partitioned heaps do not need sorted source data.
2. The rules for minimal logging work differently between heaps and clustered indexes. In some scenarios, an insert into a heap will write fewer bytes to the transaction log than a comparable insert into a table with a clustered index. Reference the [Data Loading Performance Guide][1] for some examples.
3. A clustered index prevents parallel insert into a table. The insert part of the query plan will always be located in a serial zone (DOP = 1). Starting with SQL Server 2016, an insert into a heap with no nonclustered indexes can qualify for parallel insert.
4. A clustered index will add lock and latch contention when many separate processes are inserting into the same table. Heaps with no nonclustered indexes can be loaded by many processes concurrently with less contention, especially when those inserts are sourced outside of SQL Server with Bulk Update locks.
[1]: https://docs.microsoft.com/en-us/previous-versions/sql/sql-server-2008/dd425070(v=sql.100)
Answer #2
Jack Douglas
It is possible for clustering of any kind to affect `INSERT` performance adversely.
> I am quite sure UPDATEs and DELETEs will benefit from a clustered index. What about INSERTs? My understanding is that INSERT “might” benefit from the table being a heap, both in term of speed but also other resources and hardware (I/O, CPU, memory and storage…).
If you are inserting (say) 5k rows into a big enough table, in the non-clustered case the inserted rows are all going to bunch together into a small number of blocks. This can also be true if they happen to already be clustered like the existing table data, e.g. if the key is time-based or an incrementing sequence.
On the other hand, if the new rows are dispersed randomly throughout the existing clustered rows (e.g. if they are a random uuid), it can require a write to a different leaf block *per row*.
Of course any non-clustered indexes could muddy the water here as they are also intrinsically 'clustered' from an IO perspective, but the point is that yes, "*INSERT “might” benefit from the table being a heap*". And the difference could be several orders of magnitude:
<>https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=233ade272763d764ffe7a3a837475d0c&hide=63
>Which operation will benefit if the table is a heap?
A heap is the second-least expensive table design to load and read, after a non-durable memory-optimized table. So it's well suited to use as a staging table.
For very large tables, a clustered columnstore is much better than a heap. Loading is more expensive (especially if you don't bulk load), but it's much more space-efficient, and reading is much, much faster.
Other than that your tables should typcially have a clustered primary key.