如何使用Java直接使用字符集对字节数组进行编码/解码/转码
我有一个格式错误的字符串,可能是由 MySQL JDBC 驱动程序的 bug 引起的,
示例格式错误的字符串的字节 (malformed_string.getBytes("UTF-8")
)是这样的:
C3 A4 C2 B8 C2 AD C3 A6 E2 80 93 E2 80 A1 (UTF-8 twice)
应该编码以下字节(它已经是 UTF-8 编码,但将它们视为 ISO-8859-1 编码)
----- ----- ----- ----- -------- --------
E4 B8 AD E6 96 87 (UTF-8)
应该编码以下 Unicode BigEndian 字节
--------------- ---------------------
4E 2D 65 87 (Unicode BigEndian)
我想将第一个字节解码为第二个,我尝试了 new String(malformed_string.getBytes("UTF-8"), "ISO-8859-1")
,但它没有按预期进行转码。我想知道是否有像byte[]encode/decode(byte[]src, String charsetName)
之类的东西,或者如何在java中实现上面的转码?
背景:
我有一个包含中文列名的 MySQL 表,当我用长数据更新此类列时,MySQL JDBC 驱动程序抛出如下异常:
com.mysql.jdbc.MysqlDataTruncation: Data truncation: Data too long for column 'ä¸æ–‡' at row 1
异常中的列名格式错误,应该是“中文”,必须正确显示给用户,如下所示。
com.mysql.jdbc.MysqlDataTruncation: Data truncation: Data too long for column '中文' at row 1
编辑
这是 MySQL 语句,演示格式错误的字符串是如何发生的,以及如何将其恢复为正确的字符串
show variables like 'char%';
+--------------------------+--------------------------+
| Variable_name | Value |
+--------------------------+--------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | C:\mysql\share\charsets\ |
+--------------------------+--------------------------+
-- encode
select
hex(convert(convert(unhex('E4B8ADE69687') using UTF8) using ucs2)) as `hex(src in UNICODE)`,
unhex('E4B8ADE69687') `src in UTF8`,
'E4B8ADE69687' `hex(src in UTF8)`,
hex(convert(convert(unhex('E4B8ADE69687') using latin1) using UTF8)) as `hex(src in UTF8->Latin1->UTF8)`;
+---------------------+-------------+------------------+--------------------------------+
| hex(src in UNICODE) | src in UTF8 | hex(src in UTF8) | hex(src in UTF8->Latin1->UTF8) |
+---------------------+-------------+------------------+--------------------------------+
| 4E2D6587 | 中文 | E4B8ADE69687 | C3A4C2B8C2ADC3A6E28093E280A1 |
+---------------------+-------------+------------------+--------------------------------+
1 row in set (0.00 sec)
-- decode
select
unhex('C3A4C2B8C2ADC3A6E28093E280A1') as `malformed`,
'C3A4C2B8C2ADC3A6E28093E280A1' as `hex(malformed)`,
hex(convert(convert(unhex('C3A4C2B8C2ADC3A6E28093E280A1') using utf8) using latin1)) as `hex(malformed->UTF8->Latin1)`,
convert(convert(convert(convert(unhex('C3A4C2B8C2ADC3A6E28093E280A1') using utf8) using latin1) using binary)using utf8) `malformed->UTF8->Latin1->binary->UTF8`;
+----------------+------------------------------+------------------------------+---------------------------------------+
| malformed | hex(malformed) | hex(malformed->UTF8->Latin1) | malformed->UTF8->Latin1->binary->UTF8 |
+----------------+------------------------------+------------------------------+---------------------------------------+
| ä¸æ–‡ | C3A4C2B8C2ADC3A6E28093E280A1 | E4B8ADE69687 | 中文 |
+----------------+------------------------------+------------------------------+---------------------------------------+
1 row in set (0.00 sec)
I have a malformed string which may be caused by a bug of MySQL JDBC driver,
The bytes of a sample malformed string (malformed_string.getBytes("UTF-8")
) is this:
C3 A4 C2 B8 C2 AD C3 A6 E2 80 93 E2 80 A1 (UTF-8 twice)
which should encoded the following bytes (it's already UTF-8 encoded, but treat them as ISO-8859-1 enoded)
----- ----- ----- ----- -------- --------
E4 B8 AD E6 96 87 (UTF-8)
which should encoded the following Unicode BigEndian bytes
--------------- ---------------------
4E 2D 65 87 (Unicode BigEndian)
I want to decode the 1st one to the 2nd one, I tried new String(malformed_string.getBytes("UTF-8"), "ISO-8859-1")
, but it does not transcode as expected. I'm wondering if there's something like byte[] encode/decode (byte[] src, String charsetName)
, or how to achieve the transcode above in java?
Background:
I have a MySQL table which have Chinese column names, when I update such columns with long data, MySQL JDBC driver thrown an exception like this:
com.mysql.jdbc.MysqlDataTruncation: Data truncation: Data too long for column 'ä¸æ–‡' at row 1
The column name in the exception is malformed, it should be "中文", and it must be correctly displayed to user as the following.
com.mysql.jdbc.MysqlDataTruncation: Data truncation: Data too long for column '中文' at row 1
EDIT
Here's MySQL statement to demonstrate how the malformed string occured, and how to restore it to correct string
show variables like 'char%';
+--------------------------+--------------------------+
| Variable_name | Value |
+--------------------------+--------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
| character_sets_dir | C:\mysql\share\charsets\ |
+--------------------------+--------------------------+
-- encode
select
hex(convert(convert(unhex('E4B8ADE69687') using UTF8) using ucs2)) as `hex(src in UNICODE)`,
unhex('E4B8ADE69687') `src in UTF8`,
'E4B8ADE69687' `hex(src in UTF8)`,
hex(convert(convert(unhex('E4B8ADE69687') using latin1) using UTF8)) as `hex(src in UTF8->Latin1->UTF8)`;
+---------------------+-------------+------------------+--------------------------------+
| hex(src in UNICODE) | src in UTF8 | hex(src in UTF8) | hex(src in UTF8->Latin1->UTF8) |
+---------------------+-------------+------------------+--------------------------------+
| 4E2D6587 | 中文 | E4B8ADE69687 | C3A4C2B8C2ADC3A6E28093E280A1 |
+---------------------+-------------+------------------+--------------------------------+
1 row in set (0.00 sec)
-- decode
select
unhex('C3A4C2B8C2ADC3A6E28093E280A1') as `malformed`,
'C3A4C2B8C2ADC3A6E28093E280A1' as `hex(malformed)`,
hex(convert(convert(unhex('C3A4C2B8C2ADC3A6E28093E280A1') using utf8) using latin1)) as `hex(malformed->UTF8->Latin1)`,
convert(convert(convert(convert(unhex('C3A4C2B8C2ADC3A6E28093E280A1') using utf8) using latin1) using binary)using utf8) `malformed->UTF8->Latin1->binary->UTF8`;
+----------------+------------------------------+------------------------------+---------------------------------------+
| malformed | hex(malformed) | hex(malformed->UTF8->Latin1) | malformed->UTF8->Latin1->binary->UTF8 |
+----------------+------------------------------+------------------------------+---------------------------------------+
| ä¸æ–‡ | C3A4C2B8C2ADC3A6E28093E280A1 | E4B8ADE69687 | 中文 |
+----------------+------------------------------+------------------------------+---------------------------------------+
1 row in set (0.00 sec)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
查看本教程:http://download.oracle.com/javase/ tutorial/i18n/text/string.html
重点是仅使用 jdk 类进行转码的方式是:
对于更强大的转码机制,我建议您查看 ICU>;
Check out this tutorial: http://download.oracle.com/javase/tutorial/i18n/text/string.html
The punch line is the way to transcode using only the jdk classes is :
For a more robust transcoding mechanism I suggest you check out ICU>