Why are non-digits LIKE [0-9]?

sql-server add tag

Iain Samuel McLean Elder (imported from SE)

My server's default collation is `Latin1_General_CI_AS`, as determined by this query:

    SELECT SERVERPROPERTY('Collation') AS Collation;

I was surprised to discover that with this collation I can match non-digit characters in strings using the predicate `LIKE '[0-9]'`.

Why in the default collation does this happen? I can't think of a case where this would be useful. I know I can work around the behavior using a binary collation, but it seems like a strange way to implement the default collation.

## Filtering digits produces non-digit caracters

I can demonstrate the behavior by creating a column that contains all possible single-byte character values and filtering the values with the digit-matching predicate.

The following statement creates a temporary table with 256 rows, one for each code point in the current code page:

    WITH P0(_) AS (SELECT 0 UNION ALL SELECT 0),
    P1(_) AS (SELECT 0 FROM P0 AS L CROSS JOIN P0 AS R),
    P2(_) AS (SELECT 0 FROM P1 AS L CROSS JOIN P1 AS R),
    P3(_) AS (SELECT 0 FROM P2 AS L CROSS JOIN P2 AS R),
    Tally(Number) AS (
      SELECT -1 + ROW_NUMBER() OVER (ORDER BY (SELECT 0))
      FROM P3
    )
    SELECT Number AS CodePoint, CHAR(Number) AS Symbol
    INTO #CodePage
    FROM Tally
    WHERE Number >= 0 AND Number <= 255;

Each row contains the integer value of the code point, and the character value of the code point. Not all of the character values are displayable - some of the code points are strictly control characters. Here is a selective sample of the output of `SELECT CodePoint, Symbol FROM #CodePage`:

    0	
    1	
    2	
    ...
    32	 
    33	!
    34	"
    35	#
    ...
    48	0
    49	1
    50	2
    ...
    65	A
    66	B
    67	C
    ...
    253	ý
    254	þ
    255	ÿ

I would expect to be able to filter on the Symbol column to find digit characters using a `LIKE` predicate and specifying the range of characters '0' thru '9':

    SELECT CodePoint, Symbol
    FROM #CodePage
    WHERE Symbol LIKE '[0-9]';

It produces a surprising output:

    CodePoint	Symbol
    48	0
    49	1
    50	2
    51	3
    52	4
    53	5
    54	6
    55	7
    56	8
    57	9
    178	²
    179	³
    185	¹
    188	¼
    189	½
    190	¾

The set of code points 48 thru 57 are the ones I expect. What surprises me is that the symbols for superscripts and fractions are also included in the result set!

There might be a mathematical reason to think of exponents and fractions as numbers, but it seems wrong to call them digits.

## Using binary collation as a workaround

I understand that to get the result I expect, I can force the corresponding binary collation `Latin1_General_BIN`:

    SELECT CodePoint, Symbol
    FROM #CodePage
    WHERE Symbol LIKE '[0-9]' COLLATE Latin1_General_BIN;

The result set includes only the code points 48 thru 57:

    CodePoint	Symbol
    48	0
    49	1
    50	2
    51	3
    52	4
    53	5
    54	6
    55	7
    56	8
    57	9

Top Answer

Martin Smith (imported from SE)

`[0-9]` is not some type of regular expression defined to just match digits. 

Any range in a `LIKE` pattern matches characters between the start and end character according to collation sort order.

    SELECT CodePoint,
           Symbol,
           RANK() OVER (ORDER BY Symbol COLLATE Latin1_General_CI_AS) AS Rnk
    FROM   #CodePage
    WHERE  Symbol LIKE '[0-9]' COLLATE Latin1_General_CI_AS
    ORDER  BY Symbol COLLATE Latin1_General_CI_AS 

Returns

    CodePoint            Symbol Rnk
    -------------------- ------ --------------------
    48                   0      1
    188                  ¼      2
    189                  ½      3
    190                  ¾      4
    185                  ¹      5
    49                   1      5
    50                   2      7
    178                  ²      7
    179                  ³      9
    51                   3      9
    52                   4      11
    53                   5      12
    54                   6      13
    55                   7      14
    56                   8      15
    57                   9      16

So you get these results because under your default collation these characters sort after `0` but before `9`. 

It looks as though the collation is defined to actually sort them in mathematical order with the fractions in the correct order between `0` and `1`.

You could also use a set rather than a range. To avoid `2` matching `²` you would need a `CS` collation

    SELECT CodePoint, Symbol
    FROM #CodePage
    WHERE Symbol LIKE '[0123456789]' COLLATE Latin1_General_CS_AS

Answer #2

Remus Rusanu (imported from SE)

Latin1 is code page 1252, in which [178 is 'SUPERSCRIPT TWO'][1]. This is an Unicode [superscript][2]: is **the character "2" as superscript**. According to the [Unicode Technical Standard #10][3] it should compare equal to 2, see [8.1 Collation Folding][4]:

> Map compatibility (tertiary) equivalents, **such as full-width and
> superscript characters**, to representative character(s)

The bug would be if superscript 2 would compare different from 2! Before you say 'but my column is not Unicode', rest assured: according to [MSDN](http://msdn.microsoft.com/en-us/library/ms143726.aspx) (see Windows Collations) all string comparison and sorting are done according to the Unicode rules, even when the on-disk representation is CHAR.

As for the other characters in your example, like `VULGAR FRACTION ONE QUARTER` and the like they do not compare equal to any number, but, as Mark already showed, they do sort properly between 0 and 9.

And, of course, if you'd change the code page you would get different results. Eg. with `Greek_CS_AS` ([code page 1253][5]) you would get the characters with code 178, 179 and 189. 


  [1]: http://www.microsoft.com/typography/unicode/1252.htm
  [2]: http://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts#Superscripts_and_subscripts_block
  [3]: http://www.unicode.org/reports/tr10/
  [4]: http://www.unicode.org/reports/tr10/#Collation_Folding
  [5]: http://www.microsoft.com/typography/unicode/1253.htm

2 Answers