Dirty read or phantom row - Databases

TopAnswers Databases

Meta

Databases

TeX

Code Golf

APL

C++

.net

db<>fiddle

Java

*nix

PHP

PowerShell

Python

Rust

टेक्-मराठी

Typst

Web Client Dev

Web Server Dev

Dirty read or phantom row

sql-server add tag

Niels Broertjes (imported from SE)

We are having dirty reads or phantom rows in our data warehouse. My question is not how to solve it (it is a long story and I am currently working on it), but I want to know what exactly happens to better understand the behaviour and use the correct terminology. Are these dirty reads or phantom rows?

Let's say we have `SourceTableA` in the ERP database with a `CLUSTERED UNIQUE` index on `Col1` and `Col2`. The DWH extraction process that performs the following query:

```
SELECT Col1, Col2, Col3
FROM ERPdatabase..SourceTableA WITH(NOLOCK);
```

This is not recommended, because now we can even have dirty reads, but we are forced to do this by the ERP vendor. Long story, so never mind.

Say we have `RowA` in `SourceTableA`:

```
('GreatCompany','WonderfullItem','ColorRed')
```

Now (I think...) a user session in the ERP system performs an update on `RowA` while the previous query is running like this:

```
UPDATE SourceTableA
SET Col3 = 'ColorBlue'
WHERE Col1 = 'GreatCompany'
AND Col2 = 'WonderfullItem'
```

If I am correct or not, what I see happening is that 2 rows are being read by the extraction process. I am not surprised since we use `NOLOCK` (although I am unsure if I understand the exact cause, hence this question), but I have 2 questions about this:

1. Since the row is not moved around in the CLUSTERED index, because the primary key does not change, why is it read a second time? My thinking would be that the extract process is already past that row, so it should not encounter it anymore. Or is that not how it works? The only explanation I can think of is that the extract query is not using the CLUSTERED index to read the data, but maybe a NONCLUSTERED index where the row IS moved around. Or what else am I missing? Or is it just a dirty read? Or is it just 'SQL Server does not guarantee row order unless you use an ORDER BY, DISTINCT or do not use NOLOCK'? Then I am curious why exactly. :-)

2. Regarding dirty reads, I always thought the behaviour I see were phantom rows, but, to my understanding, for a phantom row to occur we actually need 2 statements within the same transaction that read the same set of data. That is not the case here since it is only 1 statement, so implicitly I think that this should then be a dirty read? On the other hand, for a dirty read to occur we need an update to be rolled back by another transaction. That is not what I assume is happening. Or is that actually the only explanation?

Top Answer

Tibor Karaszi (imported from SE)

1. You are correct in your initial assumptions here. If the row don't move (including due to a page split), then a scan/seek won't read it twice. But, as you say, perhaps the query is done through an NC index for which the row index *was* moved, hence the double read.

2. Regarding terminology, there are some standard phenomena defined in ANSI SQL, which are then used in conjunction with isolation level to describe the isolation levels. I.e., isolation level A eliminates phenomena X and Y but not Z. Such phenomena include dirty reads (read something that has been modified but later might be rolled back), phantom (read some data, read it again and a new row occurs in that set). Double read isn't among those phenomena defined in ANSI SQL, perhaps since the standard is too posh to bother about physical implementations (like indexes) and in such a world encountering the same row twice isn't even on the table? I.e., what you describe here is what we in the SQL Server community use to call something like "double read", but don't expect to see the terminology formalized for this phenomena.

(I realize that I probably over/misuse the word phenomena here, my bad in that case!)

Missing committed rows or encountering them multiple times *due to an [allocation-ordered scan][1] over changing data* is specific to using [read uncommitted isolation][2] (enabled by the `NOLOCK` hint here). See also [Previously committed rows might be missed if NOLOCK hint is used][3] by Microsoft's Lubor Kollar.

As [David Browne][4] suggested you could consider reporting off of a database snapshot - if you can’t use a an isolation level that produces correct results.


  [1]: https://sqlperformance.com/2015/01/t-sql-queries/allocation-order-scans
  [2]: https://sqlperformance.com/2015/04/t-sql-queries/the-read-uncommitted-isolation-level
  [3]: https://techcommunity.microsoft.com/t5/datacat/previously-committed-rows-might-be-missed-if-nolock-hint-is-used/ba-p/304944
  [4]: https://dba.stackexchange.com/users/126936/david-browne-microsoft

1 Answer