I have an example data file with following contents and saved with UTF8 encoding. oab~opqr öab~öpqr öab~öpqr The format of this file is fixed width with columns 1 to 3 each being allocated 1 character and column 4 reserved 5 characters. I have created an XML format file as below <?xml version = "1.0"?> <BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <RECORD> <FIELD xsi:type="CharFixed" ID="Col1" LENGTH="1"/> <FIELD xsi:type="CharFixed" ID="Col2" LENGTH="1"/> <FIELD xsi:type="CharFixed" ID="Col3" LENGTH="1"/> <FIELD xsi:type="CharFixed" ID="Col4" LENGTH="5"/> <FIELD xsi:type="CharTerm" ID="LINE_BREAK" TERMINATOR="\n"/> </RECORD> <ROW> <COLUMN SOURCE="Col1" NAME="Col1" xsi:type="SQLNVARCHAR"/> <COLUMN SOURCE="Col2" NAME="Col2" xsi:type="SQLNVARCHAR"/> <COLUMN SOURCE="Col3" NAME="Col3" xsi:type="SQLNVARCHAR"/> <COLUMN SOURCE="Col4" NAME="Col4" xsi:type="SQLNVARCHAR"/> </ROW> </BCPFORMAT> Disappointingly running the following SQL... SELECT * FROM OPENROWSET ( BULK 'mydata.txt', FORMATFILE = 'myformat_file.xml', CODEPAGE = '65001' ) AS X Produces the following results Col1 Col2 Col3 Col4 ---- ---- ---- ----- o a b ~opqr � � a b~öp � � a b~öp from which I conclude the `LENGTH` is counting bytes rather than characters. Is there any way I can get this working correctly for fixed *character* widths with UTF8 encoding? (Target environment is Azure SQL Database reading from Blob storage) NB: It was suggested in the comments that adding `COLLATION="LATIN1_GENERAL_100_CI_AS_SC_UTF8"` to the `FIELD` elements might help but the results remain unchanged with this.
One workaround is to just change the format file to bring the whole line in, in bulk, and do the substring-ing in TSQL With format file <?xml version = "1.0"?> <BCPFORMAT xmlns="http://schemas.microsoft.com/sqlserver/2004/bulkload/format" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <RECORD> <FIELD xsi:type="CharTerm" ID="WholeLine" TERMINATOR="\n"/> </RECORD> <ROW> <COLUMN SOURCE="WholeLine" NAME="WholeLine" xsi:type="SQLNVARCHAR"/> </ROW> </BCPFORMAT> The following does return the desired results SELECT SUBSTRING(WholeLine, 1,1) AS Col1, SUBSTRING(WholeLine, 2,1) AS Col2, SUBSTRING(WholeLine, 3,1) AS Col3, SUBSTRING(WholeLine, 4,5) AS Col4 FROM OPENROWSET ( BULK 'mydata.txt', FORMATFILE = 'myformat_file.xml', CODEPAGE = '65001' ) AS X
> from which I conclude the `LENGTH` is counting bytes rather than characters. This is correct and there is no way to make it characters instead. The situation is analogous to the *n* in char(*n*), varchar(*n*), nchar(*n*), and nvarchar(*n*), where 'n' denotes the number of **bytes**, not characters. See the [documentation][1]: > A common misconception is to think that CHAR(n) and VARCHAR(n), the n defines the number of characters. But in CHAR(n) and VARCHAR(n) the n defines the string length in bytes (0-8,000). n never defines numbers of characters that can be stored. This is similar to the definition of NCHAR(n) and NVARCHAR(n). The misconception happens because when using single-byte encoding, the storage size of CHAR and VARCHAR is n bytes and the number of characters is also n. However, for multi-byte encoding such as UTF-8, higher Unicode ranges (128-1,114,111) result in one character using two or more bytes. This is a source of confusion for many, especially since the [introduction][2] of UTF-8 support. It was possible before with n(var)char and supplementary characters, but relatively rarely encountered I would say. It would be nice if SQL Server extended its support for *characters* rather than *bytes* in several areas in future (including OPENROWSET). In the meantime, [your workaround][3] is the one I would probably use too. [1]: https://docs.microsoft.com/en-gb/sql/t-sql/data-types/char-and-varchar-transact-sql#remarks [2]: https://techcommunity.microsoft.com/t5/sql-server-blog/introducing-utf-8-support-for-sql-server/ba-p/734928 [3]: https://topanswers.xyz/databases?q=1945#a2172