My server's default collation is `Latin1_General_CI_AS`, as determined by this query: SELECT SERVERPROPERTY('Collation') AS Collation; I was surprised to discover that with this collation I can match non-digit characters in strings using the predicate `LIKE '[0-9]'`. Why in the default collation does this happen? I can't think of a case where this would be useful. I know I can work around the behavior using a binary collation, but it seems like a strange way to implement the default collation. ## Filtering digits produces non-digit caracters I can demonstrate the behavior by creating a column that contains all possible single-byte character values and filtering the values with the digit-matching predicate. The following statement creates a temporary table with 256 rows, one for each code point in the current code page: WITH P0(_) AS (SELECT 0 UNION ALL SELECT 0), P1(_) AS (SELECT 0 FROM P0 AS L CROSS JOIN P0 AS R), P2(_) AS (SELECT 0 FROM P1 AS L CROSS JOIN P1 AS R), P3(_) AS (SELECT 0 FROM P2 AS L CROSS JOIN P2 AS R), Tally(Number) AS ( SELECT -1 + ROW_NUMBER() OVER (ORDER BY (SELECT 0)) FROM P3 ) SELECT Number AS CodePoint, CHAR(Number) AS Symbol INTO #CodePage FROM Tally WHERE Number >= 0 AND Number <= 255; Each row contains the integer value of the code point, and the character value of the code point. Not all of the character values are displayable - some of the code points are strictly control characters. Here is a selective sample of the output of `SELECT CodePoint, Symbol FROM #CodePage`: 0 1 2 ... 32 33 ! 34 " 35 # ... 48 0 49 1 50 2 ... 65 A 66 B 67 C ... 253 ý 254 þ 255 ÿ I would expect to be able to filter on the Symbol column to find digit characters using a `LIKE` predicate and specifying the range of characters '0' thru '9': SELECT CodePoint, Symbol FROM #CodePage WHERE Symbol LIKE '[0-9]'; It produces a surprising output: CodePoint Symbol 48 0 49 1 50 2 51 3 52 4 53 5 54 6 55 7 56 8 57 9 178 ² 179 ³ 185 ¹ 188 ¼ 189 ½ 190 ¾ The set of code points 48 thru 57 are the ones I expect. What surprises me is that the symbols for superscripts and fractions are also included in the result set! There might be a mathematical reason to think of exponents and fractions as numbers, but it seems wrong to call them digits. ## Using binary collation as a workaround I understand that to get the result I expect, I can force the corresponding binary collation `Latin1_General_BIN`: SELECT CodePoint, Symbol FROM #CodePage WHERE Symbol LIKE '[0-9]' COLLATE Latin1_General_BIN; The result set includes only the code points 48 thru 57: CodePoint Symbol 48 0 49 1 50 2 51 3 52 4 53 5 54 6 55 7 56 8 57 9
`[0-9]` is not some type of regular expression defined to just match digits. Any range in a `LIKE` pattern matches characters between the start and end character according to collation sort order. SELECT CodePoint, Symbol, RANK() OVER (ORDER BY Symbol COLLATE Latin1_General_CI_AS) AS Rnk FROM #CodePage WHERE Symbol LIKE '[0-9]' COLLATE Latin1_General_CI_AS ORDER BY Symbol COLLATE Latin1_General_CI_AS Returns CodePoint Symbol Rnk -------------------- ------ -------------------- 48 0 1 188 ¼ 2 189 ½ 3 190 ¾ 4 185 ¹ 5 49 1 5 50 2 7 178 ² 7 179 ³ 9 51 3 9 52 4 11 53 5 12 54 6 13 55 7 14 56 8 15 57 9 16 So you get these results because under your default collation these characters sort after `0` but before `9`. It looks as though the collation is defined to actually sort them in mathematical order with the fractions in the correct order between `0` and `1`. You could also use a set rather than a range. To avoid `2` matching `²` you would need a `CS` collation SELECT CodePoint, Symbol FROM #CodePage WHERE Symbol LIKE '[0123456789]' COLLATE Latin1_General_CS_AS
Latin1 is code page 1252, in which [178 is 'SUPERSCRIPT TWO'][1]. This is an Unicode [superscript][2]: is **the character "2" as superscript**. According to the [Unicode Technical Standard #10][3] it should compare equal to 2, see [8.1 Collation Folding][4]: > Map compatibility (tertiary) equivalents, **such as full-width and > superscript characters**, to representative character(s) The bug would be if superscript 2 would compare different from 2! Before you say 'but my column is not Unicode', rest assured: according to [MSDN](http://msdn.microsoft.com/en-us/library/ms143726.aspx) (see Windows Collations) all string comparison and sorting are done according to the Unicode rules, even when the on-disk representation is CHAR. As for the other characters in your example, like `VULGAR FRACTION ONE QUARTER` and the like they do not compare equal to any number, but, as Mark already showed, they do sort properly between 0 and 9. And, of course, if you'd change the code page you would get different results. Eg. with `Greek_CS_AS` ([code page 1253][5]) you would get the characters with code 178, 179 and 189. [1]: http://www.microsoft.com/typography/unicode/1252.htm [2]: http://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts#Superscripts_and_subscripts_block [3]: http://www.unicode.org/reports/tr10/ [4]: http://www.unicode.org/reports/tr10/#Collation_Folding [5]: http://www.microsoft.com/typography/unicode/1253.htm