java字符串unicode码点转换为字符
好吧,我觉得这个问题被问了很多次,但我找不到答案。我正在比较由两个不同程序生成的两个不同文件。当然,这两个程序都从相同的数据库查询生成文件。我遇到了以下差异:
s1 =
三星 - 移动 USB 充电器
对比
s2 =
三星\u2013移动USB充电器
如何将 s2 转换为 s1 甚至更好,如何比较两者而不得到有区别吗?在广泛的互联网上的某个地方提到使用 ApacheCommons-lang 的 StringUtils 类,但我找不到任何有用的东西。
Ok, so I feel like this question for asked many times but I am not able to find an answer. I am comparing two different files that were generated by two different programs. Of course both programs are generating the files from the same db queries. I am running into the following differences:
s1 =
Samsung - Mobile USB Chargers
vs.
s2 =
Samsung \u2013 Mobile USB Chargers
How do I convert s2 to s1 or even better, how do I compare the two without getting a difference? Someone somewhere on the wide wide internets mentioned to use ApacheCommons-lang's StringUtils class, but I couldn't find anything useful.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
生成第一个字符串的程序使用字符替换回退机制以 ASCII 写入文件。第二个是用 Unicode 写入文件。
可以通过使用相同的后备机制以 ASCII 格式创建第二个文件的副本来比较这些文件。
最好的解决方案是修改第一个程序,使其也使用 Unicode。
(第二个文件可能使用 Unicode 以外的其他字符,因为其他一些字符集包括破折号。如果是这样,那么最好的解决方案是在可能的情况下以 Unicode 编写这两个文件。)
The program that generated the first string is writing the file in ASCII, using a character substitution fallback mechanism. The second is writing the file in Unicode.
These could be compared by making a copy of the second file in ASCII using the same fallback mechanism.
The best solution would be to modify the first program so that it also uses Unicode.
(It is possible that the second file was using something other than Unicode, since some other character sets include the en dash. If so, then the best solution is to write both files in Unicode, if possible.)
您可以使用 Dash_Punctuation 属性折叠所有字符。
此代码将打印
true
:请注意,这将适用于具有该属性的所有字符(例如〰 U+3030 WAVY DASH)。具有 Dash_Punctuation (Pd) 属性的字符的完整列表位于 UnicodeData .txt。 Java 6 支持 Unicode 4。有关标点符号的讨论,请参阅第 6 章。
You could fold all the characters with the Dash_Punctuation property.
This code will print
true
:Note that this will apply to all characters with that property (like 〰 U+3030 WAVY DASH). A comprehensive list of characters with the Dash_Punctuation (Pd) property are in UnicodeData.txt. Java 6 supports Unicode 4. See chapter 6 for a discussion of punctuation.