如何在SQL Server中使用批量插入写入UTF-8字符？

发布于 2024-10-29 01:32:19 字数 538 浏览 9 评论 0原文

我正在对 sqlserver 进行批量插入，但它没有正确地将 UTF-8 字符插入数据库。数据文件包含这些字符，但执行批量插入后数据库行包含垃圾字符。

我的第一个怀疑是格式文件的最后一行：

10.0
3
1 SQLCHAR  0  0  "{|}"  1 INSTANCEID ""
2 SQLCHAR  0  0  "{|}"  2 PROPERTYID ""
3 SQLCHAR  0  0  "[|]"  3 CONTENTTEXT "SQL_Latin1_General_CP1_CI_AS"

但是，在阅读这个官方页面在我看来，这实际上是SQL Server 2008版本中通过插入操作读取数据文件的一个错误。我们使用的是2008 R2版本。

该问题的解决方案或至少是解决方法是什么？

原文

I am doing a BULK INSERT into sqlserver and it is not inserting UTF-8 characters into database properly. The data file contains these characters, but the database rows contain garbage characters after bulk insert execution.

My first suspect was the last line of the format file:

10.0
3
1 SQLCHAR  0  0  "{|}"  1 INSTANCEID ""
2 SQLCHAR  0  0  "{|}"  2 PROPERTYID ""
3 SQLCHAR  0  0  "[|]"  3 CONTENTTEXT "SQL_Latin1_General_CP1_CI_AS"

But, after reading this official page it seems to me that this is actually a bug in reading the data file by the insert operation in SQL Server version 2008. We are using version 2008 R2.

What is the solution to this problem or at least a workaround?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

孤云独去闲 2024-11-05 01:32:19

我在寻找批量插入特殊字符的解决方案之前来到这里。
不喜欢 UTF-16 的解决方法（这会使 csv 文件的大小增加一倍）。
我发现你绝对可以，而且非常简单，不需要格式文件。
这个答案适用于其他正在寻找相同答案的人，因为它似乎没有在任何地方得到很好的记录，我相信这对于非英语国家的人来说是一个非常常见的问题。解决办法是：
只需在批量插入的 with 语句中添加 CODEPAGE='65001' 即可。（65001=UTF-8 的代码页号）。
可能不适用于 Michael O 建议的所有 unicode 字符，但至少它非常适合拉丁语扩展、希腊语和西里尔语，可能还有许多其他字符。

注意：MSDN 文档说不支持 utf-8，不要相信，对我来说这在 SQL Server 2008 中工作得很好，但是没有尝试其他版本。

例如：

BULK INSERT #myTempTable 
FROM  'D:\somefolder\myCSV.txt'+
WITH 
    ( 
        CODEPAGE = '65001',
        FIELDTERMINATOR = '|',
        ROWTERMINATOR ='\n'
    );

如果所有特殊字符都在 160-255（iso-8859-1 或 windows-1252）之间，您还可以使用：

BULK INSERT #myTempTable 
FROM  'D:\somefolder\myCSV.txt'+
WITH 
    ( 
        CODEPAGE = 'ACP',
        FIELDTERMINATOR = '|',
        ROWTERMINATOR ='\n'
    );

I came here before looking for a solution for bulk inserting special characters.
Didn't like the workaround with UTF-16 (that would double the size of csv file).
I found out that you definitely CAN and it's very easy, you don't need a format file.
This answer is for other people who are looking for the same, since it doesn't seem to be documented well anywhere, and I believe this is a very common issue for non-english speaking people. The solution is:
just add CODEPAGE='65001' inside the with statement of the bulk insert. (65001=codepage number for UTF-8).
Might not work for all unicode characters as suggested by Michael O, but at least it works perfect for latin-extended, greek and cyrillic, probably many others too.

Note: MSDN documentation says utf-8 is not supported, don't believe it, for me this works perfect in SQL server 2008, didn't try other versions however.

e.g.:

BULK INSERT #myTempTable 
FROM  'D:\somefolder\myCSV.txt'+
WITH 
    ( 
        CODEPAGE = '65001',
        FIELDTERMINATOR = '|',
        ROWTERMINATOR ='\n'
    );

If all your special characters are in 160-255 (iso-8859-1 or windows-1252), you could also use:

BULK INSERT #myTempTable 
FROM  'D:\somefolder\myCSV.txt'+
WITH 
    ( 
        CODEPAGE = 'ACP',
        FIELDTERMINATOR = '|',
        ROWTERMINATOR ='\n'
    );

回复收藏 0 原文

红颜悴 2024-11-05 01:32:19

你不能。您应该首先使用 N 类型数据字段，将文件转换为 UTF-16，然后导入。数据库不支持UTF-8。

回复收藏 0 原文

就此别过 2024-11-05 01:32:19

在 Excel 中将文件另存为 CSV（逗号分隔）
在记事本++中打开保存的 CSV 文件
编码 ->转换为 UCS-2 Big Endian
保存

批量插入#tmpData
来自“C:\Book2.csv”
和
（
    第一行 = 2,
    FIELDTERMINATOR = ';', --CSV 字段分隔符
    ROWTERMINATOR = '\n', --用于将控件移至下一行
    塔布洛克
）；

完毕。

In excel save file as CSV(Comma delimited)
Open saved CSV file in notepad++
Encoding -> Convert tO UCS-2 Big Endian
Save

BULK INSERT #tmpData
FROM 'C:\Book2.csv'
WITH
(
    FIRSTROW = 2,
    FIELDTERMINATOR = ';',  --CSV field delimiter
    ROWTERMINATOR = '\n',   --Use to shift the control to next row
    TABLOCK
);

Done.

回复收藏 0 原文

寂寞陪衬 2024-11-05 01:32:19

Microsoft 刚刚向 SQL Server 2014 SP2 添加了 UTF-8 支持：

https://support.microsoft。 com/en-us/kb/3136780

回复收藏 0 原文

蓝天白云 2024-11-05 01:32:19

您可以使用 UTF-16 重新编码数据文件。无论如何，我就是这么做的。

回复收藏 0 原文

九公里浅绿 2024-11-05 01:32:19

使用这些选项 -
DATAFILETYPE='char' 和 CODEPAGE = '1252'

回复收藏 0 原文

我只土不豪 2024-11-05 01:32:19

请注意，从 Microsoft SQL Server 2016 开始，bcp 支持 UTF-8、BULK_INSERT（如是原始问题的一部分），以及OPENROWSET。

回复收藏 0 原文

聚集的泪 2024-11-05 01:32:19

对于 unicode 数据，您不应该使用 SQLNCHAR 而不是 SQLCHAR 吗？

回复收藏 0 原文

离线来电— 2024-11-05 01:32:19

我想我会添加我的想法。我们尝试使用 bcp 将数据加载到 SqlServer 中，但遇到了很多麻烦。

bcp 在大多数版本中不支持任何类型的 UTF-8 文件。我们发现 UTF-16 可以工作，但它比这些帖子中显示的更复杂。

我们使用 Java 使用以下代码编写文件：

PrintStream fileStream = new PrintStream(NEW_TABLE_DATA_FOLDER + fileName, "x-UTF-16LE-BOM");

这为我们提供了要插入的正确数据。

我们尝试仅使用 UTF16，但不断收到 EOF 错误。这是因为我们缺少文件的 BOM 部分。来自维基百科：

UTF-16，BOM (U+FEFF) 可以作为文件或字符流的第一个字符放置，以指示文件或流的所有 16 位代码单元的字节顺序（字节顺序）。

如果这些字节不存在，该文件将无法工作。现在我们已经有了该文件，但还有一个秘密需要解决。构建命令行时，必须包含 -w 来告诉 bcp 它是什么类型的数据。仅使用英文数据时，可以使用 -c（字符）。所以看起来像这样：

bcp dbo.blah 在 C:\Users\blah\Desktop\events\blah.txt -S tcp:databaseurl,someport -d thedatabase -U 用户名 -P 密码 -w

当这一切完成后，你会得到一些看起来很漂亮的数据！

Thought I would add my thoughts to this. We were trying to load data into SqlServer using bcp and had lots of trouble.

bcp does not, in most versions, support any type of UTF-8 files. We discovered that UTF-16 would work, but it is more complex than is shown in these posts.

Using Java we wrote the file using this code:

PrintStream fileStream = new PrintStream(NEW_TABLE_DATA_FOLDER + fileName, "x-UTF-16LE-BOM");

This gave us the correct data to insert.

We tried using just UTF16 and kept getting EOF errors. This is because we were missing the BOM part of the file. From Wikipedia:

UTF-16, a BOM (U+FEFF) may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream.

If these bytes are not present, the file won't work. So we have the file, but there is one more secret that needs to be addressed. When constructing your command line you must include -w to tell bcp what type of data it is. When using just English data, you can use -c (character). So that will look something like this:

bcp dbo.blah in C:\Users\blah\Desktop\events\blah.txt -S tcp:databaseurl,someport -d thedatabase -U username -P password -w

When this is all done you get some sweet looking data!

回复收藏 0 原文

如果没结果 2024-11-05 01:32:19

只是为了分享，我遇到了类似的问题，我在文件中有葡萄牙语重音，并且 bcp 导入了垃圾字符。（例如 À 变成了 ┴ ）
我尝试使用 -C 来处理几乎所有代码页，但没有成功。几个小时后，我在 bcp MS 帮助页面上找到了提示。

格式文件代码页优先于 -C 属性

意味着在格式文件中我必须像 LastName 中一样使用“”，一旦我更改了代码页，属性 -C 65001 就会毫无问题地导入 UTF8 文件

13.0
4
1       SQLCHAR             0       7       ","      1     PersonID               ""
2       SQLCHAR             0       25      ","      2     FirstName              SQL_Latin1_General_CP1_CI_AS
3       SQLCHAR             0       30      ","      3     LastName               ""
4       SQLCHAR             0       11      "\r\n"   4     BirthDate              ""

Only for to share, I had a similar problem, I had portugues accents in a file and bcp imported garbage chars.(e.g. À became ┴ )
I tried -C with almost all codepages without success. After hours I found a hint on the bcp MS help page.

Format File codepages are having priority over the -C attribute

Means that in the format file I had to use "" like in LastName, once I changed the codepage, the attribute -C 65001 imported the UTF8 file without any problem

13.0
4
1       SQLCHAR             0       7       ","      1     PersonID               ""
2       SQLCHAR             0       25      ","      2     FirstName              SQL_Latin1_General_CP1_CI_AS
3       SQLCHAR             0       30      ","      3     LastName               ""
4       SQLCHAR             0       11      "\r\n"   4     BirthDate              ""

回复收藏 0 原文

长伴 2024-11-05 01:32:19

我设法使用 SSIS 和 ADO NET 目标而不是 OLEDB 来完成此操作。

回复收藏 0 原文

迷迭香的记忆 2024-11-05 01:32:19

我导出的数据采用 TSV 格式，来自具有 Latin-1 编码的数据库。

这很容易检查：
SELECT DATABASEPROPERTYEX('DB', 'Collation') SQLCollation;

提取文件采用 UTF-8 格式。

BULK INSERT 不适用于 UTF-8，因此我使用简单的 Clojure 脚本将 UTF-8 转换为 ISO-8859-1（又名 Latin-1）：

(spit ".\\dump\\file1.txt" （slurp“.\\dump\\file1_utf8.txt”：编码“UTF-8”） :encoding "ISO-8859-1")

执行 - 正确的路径和
java.exe -cp clojure-1.6.0.jar clojure.main utf8_to_Latin1.clj

回复收藏 0 原文

我很坚强 2024-11-05 01:32:19

我已经使用 UTF -8 格式测试了批量插入。它在 Sql Server 2012 中工作正常。

string bulkInsertQuery = @"DECLARE @BulkInsertQuery NVARCHAR(max) = 'bulk insert [dbo].[temp_Lz_Post_Obj_Lvl_0]
                                      FROM ''C:\\Users\\suryan\\Desktop\\SIFT JOB\\New folder\\POSTdata_OBJ5.dat''
                                      WITH ( FIELDTERMINATOR =  '''+ CHAR(28) + ''', ROWTERMINATOR = ''' +CHAR(10) + ''')'
                                      EXEC SP_EXECUTESQL @BulkInsertQuery";

我使用的是 *.DAT 文件，其中 FS 作为列分隔符。

I have tested the bulk insertion with UTF -8 Format. It works fine in Sql Server 2012.

string bulkInsertQuery = @"DECLARE @BulkInsertQuery NVARCHAR(max) = 'bulk insert [dbo].[temp_Lz_Post_Obj_Lvl_0]
                                      FROM ''C:\\Users\\suryan\\Desktop\\SIFT JOB\\New folder\\POSTdata_OBJ5.dat''
                                      WITH ( FIELDTERMINATOR =  '''+ CHAR(28) + ''', ROWTERMINATOR = ''' +CHAR(10) + ''')'
                                      EXEC SP_EXECUTESQL @BulkInsertQuery";

I was using *.DAT file with FS as column separator.

回复收藏 0 原文

~没有更多了~