sql-server add tag
Martin Smith (imported from SE)
I have a table as below

    CREATE TABLE dbo.DemoTable
    (
    Value VARCHAR(12)
    )

I would like to constrain it to only contain rows where `Value` matches the following pattern

`[axyto0-9\s]{0,2}[\s0-9]{0,10}`

 - Start of string  
 - A single character present in the list below between 
   zero and 2 times, as many times as possible
  - A single character from the list “axyto” (case sensitive)
  - A character in the range between “0” and “9”
  - A “whitespace character” 
 - Match a single character present in the list below between zero and 10 times, as many times as possible
  - A “whitespace character” 
  - A character in the range between “0” and “9” 
 - End of string

How can I do this in SQL Server? ([inspired by this question on SO][1])


  [1]: https://stackoverflow.com/q/61580786
Top Answer
Martin Smith (imported from SE)
String handling in TSQL is poor.

`PATINDEX` supports a limited set of wild cards but has no support for character classes such as `\s` to match white space or quantifiers. It isn't impossible to do the validation for this relatively simple regex with a native TSQL expression. The max length of 12 is enforced by the datatype and use of `varchar` means there are only 6 white space characters in my code page.

So the following approach should work

 - First 2 characters must match `NOT LIKE CONCAT('%[^aotxy0123456789',char(9),char(10),char(11),char(12),char(13),char(32),char(160),']%')`
 - From 3rd character on must match `NOT LIKE CONCAT('%[^0123456789',char(9),char(10),char(11),char(12),char(13),char(32),char(160),']%')`   

This is already fairly messy (I'm not even certain that it is currently correct) and it becomes completely unviable for more complex patterns.

Regular Expression support is available by using CLR and is also called out as one of the possibilities enabled by [the Java language extension][1].

> This extends the TSQL surface area to better handle use cases
> involving regular expressions, string handling, and NLP support.

Examples of using regular expressions with these technologies are already available on the Microsoft Site ([CLR][2], [Java][3])

This answer gives a solution for the cases where these are not an option (perhaps as disabled by policy or using Azure SQL database where these are not available).

This leverages [XML Schema Regular Expressions][4].

## Create an XML Schema with a pattern constraint containing the regex

    CREATE XML SCHEMA COLLECTION dbo.PatternValidatorSchema
    AS '<?xml version="1.0" encoding="utf-8"?>
    <xs:schema attributeFormDefault="unqualified"
    		   elementFormDefault="qualified"
    		   xmlns:xs="http://www.w3.org/2001/XMLSchema">
    	<xs:element name="test">
    		<xs:simpleType>
    			<xs:restriction base="xs:string">
    				<xs:pattern value="[axyto0-9\s]{0,2}[\s0-9]{0,10}"/>
    			</xs:restriction>
    		</xs:simpleType>
    	</xs:element>
    </xs:schema>';

# Add a check constraint so inserts/updates are validated

    ALTER TABLE dbo.DemoTable 
     ADD CONSTRAINT CK_ValidateValueMatchesPattern 
     CHECK (CAST(ISNULL('<test>' + REPLACE(REPLACE(REPLACE(Value, '&','&amp;'), '<','&lt;'), '>','&gt;') + '</test>','') AS XML(dbo.PatternValidatorSchema)) IS NOT NULL)

This approach is not very flexible for general validation as any failure will throw an error so it can't be used to determine good and bad rows in a set based way, but for the purposes of aborting the insert if a single bad value is found it works fine.

The check constraint isn't really checking anything worthwhile. It is there to cause the conversion to typed XML to be invoked as any failures will cause an error to be thrown.

## Insert some rows 

    INSERT INTO dbo.DemoTable
    VALUES      (''),
                ('ax1234567890'),
                ('to123456'),
                ('AX1234567890'); 


Error

> Msg 6926, Level 16, State 1, Line 37 
> 
> XML Validation: Invalid simple type value: 'AX1234567890'. Location: /*:test\[1]

The final value attempted violated the expression as it is upper case so the insert was blocked. The offending value is helpfully shown in the error (though the rest of the message may cause confusion)


# Performance

Unfortunately there is a potentially significant performance penalty to pay though. When inserting 10 million valid values into the table with a `TABLOCK` hint on my machine it took about **2.6 seconds with no check constraint**.

Calling the below in a check constraint increased this to **12 seconds with the CLR function**

    using Microsoft.SqlServer.Server;
    using System.Data.SqlTypes;
    using System.Text.RegularExpressions;
    
    public partial class UserDefinedFunctions
    {
        private static readonly Regex regexObj = new Regex(@"\A[axyto0-9\s]{0,2}[\s0-9]{0,10}\z", RegexOptions.Compiled);
    
        [SqlFunction(DataAccess = DataAccessKind.None, 
                     IsDeterministic = true, 
                     IsPrecise = true, 
                     SystemDataAccess = SystemDataAccessKind.None)]
        public static SqlBoolean PatternValidator([SqlFacet(MaxSize = 12)]SqlString valueToCheck)
        {    
            return new SqlBoolean(regexObj.IsMatch(valueToCheck.Value));
        }
    }


With the below check constraint in place the **native TSQL attempt took 17 seconds**

    ALTER TABLE [dbo].[DemoTable]  WITH CHECK ADD  CONSTRAINT [CK_ValidateValueMatchesPattern] CHECK  (
    LEFT(Value,2)  COLLATE Latin1_General_100_BIN2 NOT LIKE CONCAT('%[^aotxy0123456789',char(9),char(10),char(11),char(12),char(13),char(32),char(160),']%')
          AND SUBSTRING(Value,3,10) COLLATE Latin1_General_100_BIN2 NOT LIKE CONCAT('%[^aotxy0123456789',char(9),char(10),char(11),char(12),char(13),char(32),char(160),']%')
    )

Using the **typed XML approach took 4 minutes**.

This is a penalty in the order of 24 microseconds per row vs no constraint so it would need to be evaluated whether this is a price worth paying or not taking into account expected volume of inserts/updates.  

[![enter image description here][5]][5]

Just by way of comparison after altering the check constraint to produce untyped XML it took 1 min 25 seconds to insert that quantity of rows so the typed XML does add some significant overhead on top of that.


  [1]: https://cloudblogs.microsoft.com/sqlserver/2020/04/22/open-sourcing-the-java-language-extension-for-sql-server/
  [2]: https://docs.microsoft.com/en-us/archive/msdn-magazine/2007/february/sql-server-regular-expressions-for-efficient-sql-querying
  [3]: https://docs.microsoft.com/en-us/sql/language-extensions/tutorials/search-for-string-using-regular-expressions-in-java?view=sql-server-ver15
  [4]: https://www.regular-expressions.info/xml.html
  [5]: https://i.stack.imgur.com/yMxON.png

This room is for discussion about this question.

Once logged in you can direct comments to any contributor here.

Enter question or answer id or url (and optionally further answer ids/urls from the same question) from

Separate each id/url with a space. No need to list your own answers; they will be imported automatically.