Is it possible to use OPENROWSET to import fixed width UTF8 encoded files?

sql-server azure-sql-database add tag

Martin Smith (imported from SE)

I have an example data file with following contents and saved with UTF8 encoding.

    oab~opqr
    öab~öpqr
    öab~öpqr

The format of this file is fixed width with columns 1 to 3 each being allocated 1 character and column 4 reserved 5 characters.

I have created an XML format file as below

    <?xml version = "1.0"?>  
    <BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">  
       <RECORD>  
          <FIELD xsi:type="CharFixed" ID="Col1" LENGTH="1"/>       
    	  <FIELD xsi:type="CharFixed" ID="Col2" LENGTH="1"/> 
          <FIELD xsi:type="CharFixed" ID="Col3" LENGTH="1"/> 
          <FIELD xsi:type="CharFixed" ID="Col4" LENGTH="5"/> 
          <FIELD xsi:type="CharTerm" ID="LINE_BREAK" TERMINATOR="\n"/> 
       </RECORD>  
       <ROW>  
          <COLUMN SOURCE="Col1" NAME="Col1" xsi:type="SQLNVARCHAR"/>  
          <COLUMN SOURCE="Col2" NAME="Col2" xsi:type="SQLNVARCHAR"/> 
          <COLUMN SOURCE="Col3" NAME="Col3" xsi:type="SQLNVARCHAR"/>  
          <COLUMN SOURCE="Col4" NAME="Col4" xsi:type="SQLNVARCHAR"/> 
       </ROW>  
    </BCPFORMAT>

Disappointingly running the following SQL...

    SELECT *
    FROM OPENROWSET
    (
    BULK 'mydata.txt',
    FORMATFILE = 'myformat_file.xml',
    CODEPAGE = '65001'
    ) AS X

Produces the following results

    Col1 Col2 Col3 Col4
    ---- ---- ---- -----
    o    a    b    ~opqr
    �    �    a    b~öp
    �    �    a    b~öp

from which I conclude the `LENGTH` is counting bytes rather than characters.

Is there any way I can get this working correctly for fixed *character* widths with UTF8 encoding?

(Target environment is Azure SQL Database reading from Blob storage)

NB: It was suggested in the comments that adding `COLLATION="LATIN1_GENERAL_100_CI_AS_SC_UTF8"` to the `FIELD` elements might help but the results remain unchanged with this.

Top Answer

Martin Smith (imported from SE)

One workaround is to just change the format file to bring the whole line in, in bulk, and do the substring-ing in TSQL

With format file

    <?xml version = "1.0"?>  
    <BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">  
       <RECORD>  
          <FIELD xsi:type="CharTerm" ID="WholeLine" TERMINATOR="\n"/> 
       </RECORD>  
       <ROW>  
          <COLUMN SOURCE="WholeLine" NAME="WholeLine" xsi:type="SQLNVARCHAR"/> 
       </ROW>  
    </BCPFORMAT>

The following does return the desired results 

    SELECT SUBSTRING(WholeLine, 1,1) AS Col1,
           SUBSTRING(WholeLine, 2,1) AS Col2,
    	   SUBSTRING(WholeLine, 3,1) AS Col3,
    	   SUBSTRING(WholeLine, 4,5) AS Col4
    FROM OPENROWSET
    (
    BULK 'mydata.txt',
    FORMATFILE = 'myformat_file.xml',
    CODEPAGE = '65001'
    ) AS X

Answer #2

Paul White (imported from SE)

> from which I conclude the `LENGTH` is counting bytes rather than characters.

This is correct and there is no way to make it characters instead.

The situation is analogous to the *n* in char(*n*), varchar(*n*), nchar(*n*), and nvarchar(*n*), where 'n' denotes the number of **bytes**, not characters. See the [documentation][1]:

> A common misconception is to think that CHAR(n) and VARCHAR(n), the n defines the number of characters. But in CHAR(n) and VARCHAR(n) the n defines the string length in bytes (0-8,000). n never defines numbers of characters that can be stored. This is similar to the definition of NCHAR(n) and NVARCHAR(n). The misconception happens because when using single-byte encoding, the storage size of CHAR and VARCHAR is n bytes and the number of characters is also n. However, for multi-byte encoding such as UTF-8, higher Unicode ranges (128-1,114,111) result in one character using two or more bytes.

This is a source of confusion for many, especially since the [introduction][2] of UTF-8 support. It was possible before with n(var)char and supplementary characters, but relatively rarely encountered I would say.

It would be nice if SQL Server extended its support for *characters* rather than *bytes* in several areas in future (including OPENROWSET).

In the meantime, [your workaround][3] is the one I would probably use too.


  [1]: https://docs.microsoft.com/en-gb/sql/t-sql/data-types/char-and-varchar-transact-sql#remarks
  [2]: https://techcommunity.microsoft.com/t5/sql-server-blog/introducing-utf-8-support-for-sql-server/ba-p/734928
  [3]: https://topanswers.xyz/databases?q=1945#a2172

2 Answers