Why is SQL doing an index scan and seek with the same XML query structure?

sql-server add tag

Middletone (imported from SE)

I've set up a test to benchmark xml performance in SQL server.  

**Test Setup**
- One Million rows of data
- XML defined Column with primary key
- A primary XML index
- A secondary XML index on the path
- XML data is similar in format but with variable tag names in each document

**Table and Index Design**

    CREATE TABLE [dbo].[xml_Test]
    (
    	[ID] [int] IDENTITY(1,1) NOT NULL,
    	[GUID] [varchar](50) NULL,
    	[JSON_Data] [varchar](max) NULL,
    	[XML_Data] [xml] NULL,
    	CONSTRAINT [PK_xml_Test] PRIMARY KEY CLUSTERED ([ID] ASC)
    );
    
    ALTER TABLE [dbo].[xml_Test] ADD  CONSTRAINT [DF_xml_Test_GUID]  DEFAULT (newid()) FOR [GUID];
    ALTER TABLE [dbo].[xml_Test] ADD  CONSTRAINT [PK_xml_Test] PRIMARY KEY CLUSTERED ([ID] ASC);
    
    CREATE PRIMARY XML INDEX [PK_xml] ON [dbo].[xml_Test]
    (	[XML_Data]);
    
    CREATE XML INDEX [IX_xml_Path] ON [dbo].[xml_Test] 
    (	[XML_Data]) 
    USING XML INDEX [PK_xml] FOR PATH;

**Sample XML Schema**

    <data>
      <id>3812</id>
      <guid>E3735046-1183-4A79-B8EE-806312B533D6",</guid>
      <firstName>John</firstName>
      <lastName>Doe</lastName>
      <tel>123-123-1234</tel>
      <city>Toronto</city>
      <prov>Ontario</prov>
      <Q.49.R.47>14325</Q.49.R.47>
      <Q.1>14326</Q.1>
      <Q.9>143257</Q.9>
      <Q.25>14328</Q.25>
      <Q.50>14329</Q.50>
      <Q.51>14330</Q.51>
      <Q.30>14331</Q.30>
      <Q.22>14332</Q.22>
      <Q.100>14333</Q.100>
      <Q.70.R.4>1</Q.70.R.4>
      <Q.43>14335</Q.43>
      <Q.3>14336</Q.3>
      <Q.84.R.21.L.19>1</Q.84.R.21.L.19>
      <done>1</done>
    </data>

When I go to query the table with two different values, I get a very efficient seek for one result and a scan for the other that takes orders of magnitude longer to complete.

I'm unsure as to why the second query is not able to do a seek on the index.  I think it may have something to do with the element not being in the index to start with but even then I don't know why a scan would be necessary unless the index really didn't index all paths.  

[Execution Plans via Paste the Plan][1]

**Query A**

    select ID, 'Query A'
    from xml_test
    where XML_Data.exist('(/data/Q.70.R.4)[1]')=1

[![enter image description here][2]][2]

**Query B**

    select id, 'Query B'
    from xml_test
    where XML_Data.exist('(/data/Q.61.R.15)[1]')=1 

[![enter image description here][3]][3]

As you can see, Query A takes 5ms to execute compared to almost 15s for query B.  

What are the reasons that the same query format but with different input would yield such an obviously inefficient execution plan?

**Aside**

- I was also using this to test against JSON with identically formatted JSON data and my results show that the identical JSON query as XML Query A took 22,344 ms (22 seconds!) of cpu time (3,419 ms clock time) to run.  
- If the XML index can work reliably with a seek operation, it is clearly the more performant read option contrary to what I have researched and read online.
- When querying over a large amount of data, I would not recommend JSON at all inside SQL server other than for blob storage or where you have a small dataset, which is to say, don't use it in production for searchable data that should be indexed.  There are other good services designed for that.


  [1]: https://www.brentozar.com/PasteThePlan/?id=rkbjYn0Vd
  [2]: https://i.stack.imgur.com/NZ8Yj.png
  [3]: https://i.stack.imgur.com/nIKHj.png

Top Answer

Mikael Eriksson (imported from SE)

XML index in SQL Server is implemented as an internal table that is a persisted version of the node table that is much the same as the XML shredding functions produce.

One of the columns in the internal table is called hid and that column contains the value used in the seek for a path expression. The value is an [ordpath](https://www.cs.umb.edu/~poneil/ordpath.pdf) value. When you create a path xml index you get a non clustered index with hid as the leading column.

The ordpath for a specific element name is generated when SQL Server inserts XML data to the table. To encode the path, the different ordpath values are concatenated.

Simplified the path X/Y/Z could be represented as 1.2.3 and Y/Z/Z as 2.3.3.

SQL Server has to have a way to generate the same ordpath value for the same element name. That could be done with some kind of dictionary or table that keeps track of all generated ordpath values for one XML index. I'm a bit fuzzy here because I have not found anything anywhere that says this is how it is done and I have not found where it is stored.

Inserting one row with this XML

    <X>
      <Y>
        <Z />
      </Y>
    </X>

will give you this content in the internal table with one extra column added by me to show the path expression encoded in hid

    id      hid     pk1
    0x              1    
    0x58    À€      1     Z
    0x5AC0  Á€À€    1     Y/Z
    0x5AD6  Â€Á€À€  1     X/Y/Z

You can look at the internal table using a DAC login.

What happens when you query for a path expression (and here I get a bit speculative again) is that the same dictionary/lookup table that is used to make sure the elements with the same name get the same ordpath value is used to find the hid to use for the index seek in the internal table. The path expression in a XQuery is always static so the hid value used for the query can be stored in the query plan and resused.

A query for a node that exists get a query with a seek on the hid where `Î` is the hid value to search for.


    <SeekKeys>
      <Prefix ScanType="EQ">
        <RangeColumns>
          <ColumnReference Database="[yy]" Schema="[sys]" Table="[xml_index_nodes_658101385_256000]" 
    					   Alias="[NodeThatExist:1]" Column="hid" />
        </RangeColumns>
        <RangeExpressions>
          <ScalarOperator ScalarString="'Î'">
            <Const ConstValue="'Î'" />
          </ScalarOperator>
        </RangeExpressions>
      </Prefix>
    </SeekKeys>


A query for a node that does not exist gives you a very different way of locating the rows.

    select count(*)
    from dbo.T
    where T.X.exist('NodeDoesNotExist') = 1

[![enter image description here][1]][1]

The compute scalar uses some functions to calculate values that are checked against in the filter operator.

[![enter image description here][2]][2]


Some of the tests I did to get here can be found [here](https://pastebin.com/0aLFZi1Y). Note that you don't actually need a generated hid for a element in the index to get the seek plan, you just have to try to insert it.

As a side note, the rewriting of the query to use one way or the other to find the rows of a XML query is done before the query optimizer starts to do its job so if you get a seek or not also depends on cardinality on the hid column in the internal table.

__________

Paul White mentioned something in chat about where the element names inserted are stored:

> QNames are stored in sys.sysqnames. The XML algebrizer requires an
> existing known QName to generate an optimized path transform. i.e. to
> use a path index. Entries are added to sys.sysqnames in a system
> transaction that does not feel user transactions


  [1]: https://i.stack.imgur.com/wZI67.png
  [2]: https://i.stack.imgur.com/7Aicv.png

Answer #2

Paul White (imported from SE)

Consider storing your xml in a more usual format if you can. This might require a change at an earlier stage of the process, or some pre-processing when you import the data, but it could well be worth it.

The key observation is that encoding information in element names is quite unusual. Using xml with a predictable structure (ideally conforming to a schema) containing variable data is normally a much better idea.

To illustrate, I have restructured the sample data to a more general *item* and *value* organization below:

```
CREATE TABLE [dbo].[xml_Test]
(
    [ID] [int] IDENTITY(1,1) NOT NULL,
    [GUID] [varchar](50) NULL DEFAULT NEWID(),
    [JSON_Data] [varchar](max) NULL,
    [XML_Data] [xml] NULL,
    CONSTRAINT [PK_xml_Test] PRIMARY KEY CLUSTERED ([ID] ASC)
);

INSERT TOP (10000) 
    dbo.xml_Test (XML_Data)
SELECT
    CONVERT(xml,
        N'
        <data>
            <id>3812</id>
            <guid>E3735046-1183-4A79-B8EE-806312B533D6</guid>
            <firstName>John</firstName>
            <lastName>Doe</lastName>
            <tel>123-123-1234</tel>
            <city>Toronto</city>
            <prov>Ontario</prov>
            <qitems>
                <item><name>Q.49.R.47</name><value>14325</value></item>
                <item><name>Q.1</name><value>14326</value></item>
                <item><name>Q.9</name><value>14357</value></item>
                <item><name>Q.25</name><value>14328</value></item>
                <item><name>Q.50</name><value>14329</value></item>
                <item><name>Q.51</name><value>14330</value></item>
                <item><name>Q.30</name><value>14331</value></item>
                <item><name>Q.22</name><value>14332</value></item>
                <item><name>Q.100</name><value>14333</value></item>
                <item><name>Q.70.R.4.1</name><value>1</value></item>
                <item><name>Q.43</name><value>14335</value></item>
                <item><name>Q.3</name><value>14336</value></item>
                <item><name>Q.84.R.21.L.19</name><value>1</value></item>
            </qitems>
            <done>1</done>
        </data>
        ')
FROM sys.all_columns AS AC1
CROSS JOIN sys.all_columns AS AC2;
```

I happened to use an element-centric design, but you could equally use *name* and *value* attributes on the *item* element if you prefer.

Adding one xml instance with a different Q name (Q.999):

```
INSERT dbo.xml_Test (XML_Data)
SELECT
    CONVERT(xml,
        N'
        <data>
            <id>3812</id>
            <guid>E3735046-1183-4A79-B8EE-806312B533D6</guid>
            <firstName>John</firstName>
            <lastName>Doe</lastName>
            <tel>123-123-1234</tel>
            <city>Toronto</city>
            <prov>Ontario</prov>
            <qitems>
                <item><name>Q.999</name><value>-999</value></item>
            </qitems>
            <done>1</done>
        </data>
        ');
```

The reorganized structure is suitable for a [selective xml index][1]:

```
CREATE SELECTIVE XML INDEX sxi
ON dbo.xml_Test (XML_Data)
FOR  
(  
    qitem_path = '/data/qitems/item' AS XQUERY 'node()',
    qitem_name =  '/data/qitems/item/name' AS XQUERY 'xs:string' MAXLENGTH(20) SINGLETON
    --qitem_value = '/data/qitems/item/value' AS SQL integer SINGLETON,
);

-- Optional secondary index for name
CREATE XML INDEX sxi_qitem_name
ON dbo.xml_Test (XML_Data)
USING XML INDEX sxi
FOR  (qitem_name);

```

Uncomment the value path if you want the index to be useful to extract *values*.

Locating the Q.999 name is now extremely efficient:

```
SELECT
    --XT.XML_Data.value('(./data/qitems/item/value)[1]', 'integer'),
    XT.ID
FROM dbo.xml_Test AS XT
WHERE 
    XT.XML_Data.exist('./data/qitems/item[name="Q.999"]') = 1;
```

[![enter image description here][2]][2]

Searching for a non-existent name (Q.61.R.15) is equally efficient:

```
SELECT
    --XT.XML_Data.value('(./data/qitems/item/value)[1]', 'integer'),
    XT.ID
FROM dbo.xml_Test AS XT
WHERE 
    XT.XML_Data.exist('./data/qitems/item[name="Q.61.R.15"]') = 1;
```

[![enter image description here][3]][3]

You can avoid one of the joins there by making the xml column `NOT NULL`, if that is appropriate.

  [1]: https://docs.microsoft.com/en-us/sql/t-sql/statements/create-selective-xml-index-transact-sql
  [2]: https://i.stack.imgur.com/J9NBy.png
  [3]: https://i.stack.imgur.com/o6hN8.png

2 Answers