SQL Server JDBC：LATIN1到UTF-8

发布于 2025-02-09 23:54:44 字数 1501 浏览 3 评论 0原文

SQL Server中有一个表格与Colation SQL_LATIN1_GENERAR_CP1_CS_AS。该表具有一个列VARCHAR（35），并具有相同的平面sql_latin1_general_cp1_cs_as。

该列包含一个带有字符 8F （十六进制）的字符串。

参见 https://wwwww.fileformat.info/info/info/info/info/info/char chary/char of Char /008f/index.htm 根据此页面，此字符转换为UTF8应成为 c28f 。

当我从Java中的此列中读取该值并将其转换为UTF-8时，8F被替换为 efbfbd 。因此，8F丢失了...一种。 See https://www.fileformat.info/info/unicode/char /0fffd/index.htm

     public static String convertStrToHex(String str) {
         byte[] getBytesFromString = str.getBytes(StandardCharsets.UTF_8);
            
         BigInteger bigInteger = new BigInteger(1, getBytesFromString);
         String convertedResult = String.format("%X", bigInteger);
        
         return convertedResult;
     }

当我查询表时，

select BadCol from MyTbl
System.out.println(convertStrToHex(resultSet.getString(1));

我会得到efbfbd而不是C28F。

当我声明字符串变量“ \ u008f”并在UTF-8中转换时：

String code="\u008f";
System.out.println(convertStrToHex(code);

我得到了正确的C28F。

那么，为什么变量会正确转换，而是通过JDBC-录音集错误？

使用SQL Server 2017和2019以及JDBC：MSSQL和JTD进行了测试，结果相同。

感谢任何帮助！据我了解，JDBC驾驶员应归咎于。但为什么？？

原文

There is a table in SQL Server with collation SQL_Latin1_General_CP1_CS_AS.
The table has a column varchar(35) with the same collation SQL_Latin1_General_CP1_CS_AS.

The column contains a string with the character 8f (hexadecimal).

See https://www.fileformat.info/info/unicode/char/008f/index.htm
According to this page, this character converted into UTF8 should become c28f.

When I read the value from this column in Java and convert it to UTF-8, the 8f is replaced with efbfbd. So the 8f get's lost... a kind of.
See https://www.fileformat.info/info/unicode/char/0fffd/index.htm

     public static String convertStrToHex(String str) {
         byte[] getBytesFromString = str.getBytes(StandardCharsets.UTF_8);
            
         BigInteger bigInteger = new BigInteger(1, getBytesFromString);
         String convertedResult = String.format("%X", bigInteger);
        
         return convertedResult;
     }

When I query the table

select BadCol from MyTbl
System.out.println(convertStrToHex(resultSet.getString(1));

I get EFBFBD and not C28F.

When I declare a string variable "\u008f" and convert it in UTF-8:

String code="\u008f";
System.out.println(convertStrToHex(code);

I get correctly C28F.

So, why is a variable gets converted correctly, but over JDBC->RecordSet wrongly?

Tested with SQL Server 2017 and 2019 and JDBC: mssql and jTDS with the same result.

I would appreciate any help!
As I understand, the JDBC driver is to blame. But why??

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

千紇 2025-02-16 23:54:44

Latin-1中不存在具有十六进制代码8F的字符。这是一个无效的角色。

因此，当转换为UTF-8时，将其替换为替换字符。

替换字符具有UNICODE代码点u+fffd。在UTF-8中编码，它变为EF BF BD。

回复收藏 0 原文

夏末 2025-02-16 23:54:44

您是正确的，8F不是有效的UTF-8字节。 8F也不是有效的Latin1字符。

8F是某些Windows Charset中的有效字符，它们是ISO 8859- n charsets的超集。您的VARCHAR值可能是Windows-1250，Windows-1251，Windows-1256或Windows-1257值。您将必须根据用户的语言或不理想的是软件的默认语言做出假设。

如果可能的话，将您的JDBC连接设置为使用其中一种charset。完成操作将取决于您正在使用的数据库。

（确切地那就在阅读数据库中的值时自己进行转换。替换以下：

resultSet.getString(1)

其中之一：

new String(resultSet.getBytes(1), "windows-1250")
new String(resultSet.getBytes(1), "windows-1251")
new String(resultSet.getBytes(1), "windows-1256")
new String(resultSet.getBytes(1), "windows-1257")

Windows-1250适用于中欧和东欧。 wikipedia> wikipedia说可以用于波兰语，捷克，捷克，斯洛伐克，斯洛伐克，斯洛伐克，斯洛伐克，斯洛维尼，serbo，serbo，serbo - 表面，罗马尼亚语，阿尔巴尼亚语和德语文字。

Windows-1251用于西里尔语言。 Wikipedia说可以用于俄罗斯，乌克兰人和白俄罗斯人等。

windows-1256 适用于阿拉伯语。

windows-1257 适用于爱沙尼亚，拉脱维安和立陶宛人。

You are correct that 8f is not a valid UTF-8 byte. 8f also is not a valid Latin1 character.

8f is a valid character in some Windows charsets, which are supersets of ISO 8859-n charsets. Your varchar value is probably a Windows-1250, Windows-1251, Windows-1256, or Windows-1257 value. You will have to make an assumption based on the language of your users or, less ideally, the default language of your software.

If possible, set your JDBC connection to use one of those charsets. (Exactly how that is done will depend on which database you are using. For instance, I believe MySQL allows characterEncoding=windows-1250 as a query parameter in a JDBC URL.)

If you can’t do that, do the conversion yourself when reading the value from the database. Replace this:

resultSet.getString(1)

with one of these:

new String(resultSet.getBytes(1), "windows-1250")
new String(resultSet.getBytes(1), "windows-1251")
new String(resultSet.getBytes(1), "windows-1256")
new String(resultSet.getBytes(1), "windows-1257")

Windows-1250 is for central and eastern Europe. Wikipedia says it can be used for Polish, Czech, Slovak, Hugarian, Slovene, Serbo-Creatian, Romanian, Albanian, and German text.

Windows-1251 is for Cyrillic languages. Wikipedia says is can be used for Russian, Ukrainian, and Belarusian, among others.

Windows-1256 is for Arabic languages.

Windows-1257 is for Estonian, Latvian, and Lithuanian.

回复收藏 0 原文

~没有更多了~