My server's default collation is `Latin1_General_CI_AS`, as determined by this query:
SELECT SERVERPROPERTY('Collation') AS Collation;
I was surprised to discover that with this collation I can match non-digit characters in strings using the predicate `LIKE '[0-9]'`.
Why in the default collation does this happen? I can't think of a case where this would be useful. I know I can work around the behavior using a binary collation, but it seems like a strange way to implement the default collation.
## Filtering digits produces non-digit caracters
I can demonstrate the behavior by creating a column that contains all possible single-byte character values and filtering the values with the digit-matching predicate.
The following statement creates a temporary table with 256 rows, one for each code point in the current code page:
WITH P0(_) AS (SELECT 0 UNION ALL SELECT 0),
P1(_) AS (SELECT 0 FROM P0 AS L CROSS JOIN P0 AS R),
P2(_) AS (SELECT 0 FROM P1 AS L CROSS JOIN P1 AS R),
P3(_) AS (SELECT 0 FROM P2 AS L CROSS JOIN P2 AS R),
Tally(Number) AS (
SELECT -1 + ROW_NUMBER() OVER (ORDER BY (SELECT 0))
FROM P3
)
SELECT Number AS CodePoint, CHAR(Number) AS Symbol
INTO #CodePage
FROM Tally
WHERE Number >= 0 AND Number <= 255;
Each row contains the integer value of the code point, and the character value of the code point. Not all of the character values are displayable - some of the code points are strictly control characters. Here is a selective sample of the output of `SELECT CodePoint, Symbol FROM #CodePage`:
0
1
2
...
32
33 !
34 "
35 #
...
48 0
49 1
50 2
...
65 A
66 B
67 C
...
253 ý
254 þ
255 ÿ
I would expect to be able to filter on the Symbol column to find digit characters using a `LIKE` predicate and specifying the range of characters '0' thru '9':
SELECT CodePoint, Symbol
FROM #CodePage
WHERE Symbol LIKE '[0-9]';
It produces a surprising output:
CodePoint Symbol
48 0
49 1
50 2
51 3
52 4
53 5
54 6
55 7
56 8
57 9
178 ²
179 ³
185 ¹
188 ¼
189 ½
190 ¾
The set of code points 48 thru 57 are the ones I expect. What surprises me is that the symbols for superscripts and fractions are also included in the result set!
There might be a mathematical reason to think of exponents and fractions as numbers, but it seems wrong to call them digits.
## Using binary collation as a workaround
I understand that to get the result I expect, I can force the corresponding binary collation `Latin1_General_BIN`:
SELECT CodePoint, Symbol
FROM #CodePage
WHERE Symbol LIKE '[0-9]' COLLATE Latin1_General_BIN;
The result set includes only the code points 48 thru 57:
CodePoint Symbol
48 0
49 1
50 2
51 3
52 4
53 5
54 6
55 7
56 8
57 9
`[0-9]` is not some type of regular expression defined to just match digits.
Any range in a `LIKE` pattern matches characters between the start and end character according to collation sort order.
SELECT CodePoint,
Symbol,
RANK() OVER (ORDER BY Symbol COLLATE Latin1_General_CI_AS) AS Rnk
FROM #CodePage
WHERE Symbol LIKE '[0-9]' COLLATE Latin1_General_CI_AS
ORDER BY Symbol COLLATE Latin1_General_CI_AS
Returns
CodePoint Symbol Rnk
-------------------- ------ --------------------
48 0 1
188 ¼ 2
189 ½ 3
190 ¾ 4
185 ¹ 5
49 1 5
50 2 7
178 ² 7
179 ³ 9
51 3 9
52 4 11
53 5 12
54 6 13
55 7 14
56 8 15
57 9 16
So you get these results because under your default collation these characters sort after `0` but before `9`.
It looks as though the collation is defined to actually sort them in mathematical order with the fractions in the correct order between `0` and `1`.
You could also use a set rather than a range. To avoid `2` matching `²` you would need a `CS` collation
SELECT CodePoint, Symbol
FROM #CodePage
WHERE Symbol LIKE '[0123456789]' COLLATE Latin1_General_CS_AS
Latin1 is code page 1252, in which [178 is 'SUPERSCRIPT TWO'][1]. This is an Unicode [superscript][2]: is **the character "2" as superscript**. According to the [Unicode Technical Standard #10][3] it should compare equal to 2, see [8.1 Collation Folding][4]:
> Map compatibility (tertiary) equivalents, **such as full-width and
> superscript characters**, to representative character(s)
The bug would be if superscript 2 would compare different from 2! Before you say 'but my column is not Unicode', rest assured: according to [MSDN](http://msdn.microsoft.com/en-us/library/ms143726.aspx) (see Windows Collations) all string comparison and sorting are done according to the Unicode rules, even when the on-disk representation is CHAR.
As for the other characters in your example, like `VULGAR FRACTION ONE QUARTER` and the like they do not compare equal to any number, but, as Mark already showed, they do sort properly between 0 and 9.
And, of course, if you'd change the code page you would get different results. Eg. with `Greek_CS_AS` ([code page 1253][5]) you would get the characters with code 178, 179 and 189.
[1]: http://www.microsoft.com/typography/unicode/1252.htm
[2]: http://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts#Superscripts_and_subscripts_block
[3]: http://www.unicode.org/reports/tr10/
[4]: http://www.unicode.org/reports/tr10/#Collation_Folding
[5]: http://www.microsoft.com/typography/unicode/1253.htm