sql-server add tag
Iain Samuel McLean Elder (imported from SE)
My server's default collation is `Latin1_General_CI_AS`, as determined by this query:

    SELECT SERVERPROPERTY('Collation') AS Collation;

I was surprised to discover that with this collation I can match non-digit characters in strings using the predicate `LIKE '[0-9]'`.

Why in the default collation does this happen? I can't think of a case where this would be useful. I know I can work around the behavior using a binary collation, but it seems like a strange way to implement the default collation.

## Filtering digits produces non-digit caracters

I can demonstrate the behavior by creating a column that contains all possible single-byte character values and filtering the values with the digit-matching predicate.

The following statement creates a temporary table with 256 rows, one for each code point in the current code page:

    WITH P0(_) AS (SELECT 0 UNION ALL SELECT 0),
    P1(_) AS (SELECT 0 FROM P0 AS L CROSS JOIN P0 AS R),
    P2(_) AS (SELECT 0 FROM P1 AS L CROSS JOIN P1 AS R),
    P3(_) AS (SELECT 0 FROM P2 AS L CROSS JOIN P2 AS R),
    Tally(Number) AS (
      SELECT -1 + ROW_NUMBER() OVER (ORDER BY (SELECT 0))
      FROM P3
    )
    SELECT Number AS CodePoint, CHAR(Number) AS Symbol
    INTO #CodePage
    FROM Tally
    WHERE Number >= 0 AND Number <= 255;

Each row contains the integer value of the code point, and the character value of the code point. Not all of the character values are displayable - some of the code points are strictly control characters. Here is a selective sample of the output of `SELECT CodePoint, Symbol FROM #CodePage`:

    0	
    1	
    2	
    ...
    32	 
    33	!
    34	"
    35	#
    ...
    48	0
    49	1
    50	2
    ...
    65	A
    66	B
    67	C
    ...
    253	ý
    254	þ
    255	ÿ

I would expect to be able to filter on the Symbol column to find digit characters using a `LIKE` predicate and specifying the range of characters '0' thru '9':

    SELECT CodePoint, Symbol
    FROM #CodePage
    WHERE Symbol LIKE '[0-9]';

It produces a surprising output:

    CodePoint	Symbol
    48	0
    49	1
    50	2
    51	3
    52	4
    53	5
    54	6
    55	7
    56	8
    57	9
    178	²
    179	³
    185	¹
    188	¼
    189	½
    190	¾

The set of code points 48 thru 57 are the ones I expect. What surprises me is that the symbols for superscripts and fractions are also included in the result set!

There might be a mathematical reason to think of exponents and fractions as numbers, but it seems wrong to call them digits.

## Using binary collation as a workaround

I understand that to get the result I expect, I can force the corresponding binary collation `Latin1_General_BIN`:

    SELECT CodePoint, Symbol
    FROM #CodePage
    WHERE Symbol LIKE '[0-9]' COLLATE Latin1_General_BIN;

The result set includes only the code points 48 thru 57:

    CodePoint	Symbol
    48	0
    49	1
    50	2
    51	3
    52	4
    53	5
    54	6
    55	7
    56	8
    57	9
Top Answer
Martin Smith (imported from SE)
`[0-9]` is not some type of regular expression defined to just match digits. 

Any range in a `LIKE` pattern matches characters between the start and end character according to collation sort order.

    SELECT CodePoint,
           Symbol,
           RANK() OVER (ORDER BY Symbol COLLATE Latin1_General_CI_AS) AS Rnk
    FROM   #CodePage
    WHERE  Symbol LIKE '[0-9]' COLLATE Latin1_General_CI_AS
    ORDER  BY Symbol COLLATE Latin1_General_CI_AS 

Returns

    CodePoint            Symbol Rnk
    -------------------- ------ --------------------
    48                   0      1
    188                  ¼      2
    189                  ½      3
    190                  ¾      4
    185                  ¹      5
    49                   1      5
    50                   2      7
    178                  ²      7
    179                  ³      9
    51                   3      9
    52                   4      11
    53                   5      12
    54                   6      13
    55                   7      14
    56                   8      15
    57                   9      16

So you get these results because under your default collation these characters sort after `0` but before `9`. 

It looks as though the collation is defined to actually sort them in mathematical order with the fractions in the correct order between `0` and `1`.

You could also use a set rather than a range. To avoid `2` matching `²` you would need a `CS` collation

    SELECT CodePoint, Symbol
    FROM #CodePage
    WHERE Symbol LIKE '[0123456789]' COLLATE Latin1_General_CS_AS
Answer #2
Remus Rusanu (imported from SE)
Latin1 is code page 1252, in which [178 is 'SUPERSCRIPT TWO'][1]. This is an Unicode [superscript][2]: is **the character "2" as superscript**. According to the [Unicode Technical Standard #10][3] it should compare equal to 2, see [8.1 Collation Folding][4]:

> Map compatibility (tertiary) equivalents, **such as full-width and
> superscript characters**, to representative character(s)

The bug would be if superscript 2 would compare different from 2! Before you say 'but my column is not Unicode', rest assured: according to [MSDN](http://msdn.microsoft.com/en-us/library/ms143726.aspx) (see Windows Collations) all string comparison and sorting are done according to the Unicode rules, even when the on-disk representation is CHAR.

As for the other characters in your example, like `VULGAR FRACTION ONE QUARTER` and the like they do not compare equal to any number, but, as Mark already showed, they do sort properly between 0 and 9.

And, of course, if you'd change the code page you would get different results. Eg. with `Greek_CS_AS` ([code page 1253][5]) you would get the characters with code 178, 179 and 189. 


  [1]: http://www.microsoft.com/typography/unicode/1252.htm
  [2]: http://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts#Superscripts_and_subscripts_block
  [3]: http://www.unicode.org/reports/tr10/
  [4]: http://www.unicode.org/reports/tr10/#Collation_Folding
  [5]: http://www.microsoft.com/typography/unicode/1253.htm

This room is for discussion about this question.

Once logged in you can direct comments to any contributor here.

Enter question or answer id or url (and optionally further answer ids/urls from the same question) from

Separate each id/url with a space. No need to list your own answers; they will be imported automatically.