从XML解析,插入到mysql;字符给出 java.sql.SQLException:错误的字符串值

发布于 2024-11-24 01:10:25 字数 352 浏览 1 评论 0原文

我正在解析一堆 XML 文件并将从它们获得的值插入到 MySQL 数据库中。 mysql表的字符集设置为utf8。我使用以下连接 url 连接到数据库 - jdbc:mysql://localhost:3306/articles_data?useUnicode=false&characterEncoding=utf8

大多数带有 unicode 字符的字符串值都可以正常输入(如希腊字母等),除了一些带有数学符号的。特别是一个例子 - 当我尝试插入带有数学脚本大写 g 的字符串时(img at www.ncbi.nlm.nih.gov/corehtml/pmc/pmcents/1D4A2.gif)(http://graphemica.com/

I am parsing a bunch of XML files and inserting the value obtained from them into a MySQL database. The character set of the mysql tables is set to utf8. I'm connecting to the database using the following connection url - jdbc:mysql://localhost:3306/articles_data?useUnicode=false&characterEncoding=utf8

Most of the string values with unicode characters are entered fine (like Greek letters etc.), except for some that have a math symbol. An example in particular - when I try to insert a string with mathematical script capital g (img at www.ncbi.nlm.nih.gov/corehtml/pmc/pmcents/1D4A2.gif) ( http://graphemica.com/???? ) (Trying to parse and insert this article), I get the following exception -

java.sql.SQLException: Incorrect string value: '\xF0\x9D\x92\xA2 i...' for column 'text' at row 1
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:1055)
at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:956)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3515)
at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3447)
at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:1951)
at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2101)
at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2554)
at com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1761)
at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:2046)
at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1964)
at com.mysql.jdbc.PreparedStatement.executeUpdate(PreparedStatement.java:1949)

If I change my connection URL to - jdbc:mysql://localhost:3306/articles_data, then the insert works, but all regular UTF8 characters are replaced with a question mark.

There are two possible ways I'm trying to fix it, and haven't succeeded at either yet -

  1. When parsing the article, maintain the encoding. I'm using org.apache.xerces.parsers.DOMParser to parse the xml files, but can't figure out how to prevent it from decoding (relevant XML - <p>𝒢 is a set containing...</p>). I could re-encode it, but that just seems inefficient.

  2. Insert the math symbols into the database.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

撩人痒 2024-12-01 01:10:25

MySQL 5.1 版之前似乎只支持基本多语言平面中的 unicode 字符,编码为 utf-8 时不超过 3 个字节。来自 版本 5.1 中的 unicode 支持手册

MySQL 5.1支持两种字符集来存储Unicode数据:

  • ucs2,Unicode 字符集的 UCS-2 编码,每个字符使用 16 位
  • utf8,Unicode 字符集的 UTF-8 编码,每个字符使用一到三个字节

utf8,每个字符使用 1 到 3 个字节的 href="http://dev.mysql.com/doc/refman/5.5/en/charset-unicode.html" rel="nofollow">版本 5.5 一些新的字符集,其中补充道:

...

  • utf8mb4,Unicode 字符集的 UTF-8 编码,每个字符使用一到四个字节

ucs2和utf8支持BMP字符。 utf8mb4、utf16 和 utf32 支持 BMP 和增补字符。

因此,如果您使用的是 mysql 5.1,您首先必须升级。在更高版本中,您必须将字符集更改为 utf8mb4 才能使用这些增补字符。

看来 jdbc 连接器还需要一些进一步的配置(来自 连接器/J 注释和提示):

要在 Connector/J 中使用 4 字节 UTF8,请使用 character_set_server=utf8mb4 配置 MySQL 服务器。只要连接字符串中尚未设置 characterEncoding,Connector/J 就会使用该设置。这相当于自动检测字符集。

MySQL up to version 5.1 seems to only support unicode characters in the basic multilingual plane, which when encoded as utf-8 take no more than 3 bytes. From the manual on unicode support in version 5.1:

MySQL 5.1 supports two character sets for storing Unicode data:

  • ucs2, the UCS-2 encoding of the Unicode character set using 16 bits per character
  • utf8, a UTF-8 encoding of the Unicode character set using one to three bytes per character

In version 5.5 some new character sets where added:

...

  • utf8mb4, a UTF-8 encoding of the Unicode character set using one to four bytes per character

ucs2 and utf8 support BMP characters. utf8mb4, utf16, and utf32 support BMP and supplementary characters.

So if you are on mysql 5.1 you would first have to upgrade. In later versions you have to change the charset to utf8mb4 to work with these supplementary characters.

It seems the jdbc connector also requires some further configuration (From Connector/J Notes and Tips):

To use 4-byte UTF8 with Connector/J configure the MySQL server with character_set_server=utf8mb4. Connector/J will then use that setting as long as characterEncoding has not been set in the connection string. This is equivalent to autodetection of the character set.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文