将 Unicode (CJK ExtB) 字符转换为 Java/Scala 中的十进制 NCR
我正在尝试将 CJK ExtB 计划中包含 Unicode 字符的 Java 字符串转换为十进制 NCR。
例如(您可以尝试使用 http://people.w3.org/rishida/tools/ conversion/ ):
- "游钖堃" 应转换为
游鍚堃
- "
I'm trying to convert a Java string contains Unicode character in CJK ExtB plan to Decimal NCRs.
For example (you could try it with http://people.w3.org/rishida/tools/conversion/ ):
- "游鍚堃" should convert to
游鍚堃
- "????懷" should convert to
𧦧懷
Here is what I tried (in Scala):
def charToHex(char: Char) = "%d;" format(char.toInt)
def stringToHex (string: String) = string.flatMap(charToHex)
println (stringToHex("游鍚堃")) // 游鍚堃
println (stringToHex("????懷")) // 懷
println ("????懷".toCharArray().length) // Why it is 3?
As you can see, it convert correctly in the first case, three unicode characters to three NCRs.
But in the second case "????懷", there are only two unicode characters, but Java/Scala seems to think it is a string contains three characters.
So, what is happening here and how could I convert the second case correctly just like the converter on the site I mentioned? Thanks a lot.
Update:
- My source code file is using UTF-8.
- Here is the result of "????懷".toCharArray()
char[] = ?, char.toInt = 55390
char[] = ?, char.toInt = 56743
char[] = 懷, char.toInt = 25079
Now I think I know what happened. The character "????" is encoded as 0xD85E 0xDDA7 in UTF-16, which is 4 bytes instead of 2 bytes. So it takes 2 elements when convert to array of char, where data type char
could only represent 2 bytes.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
Java(以及 Scala)对其字符串使用 UTF-16 编码,这意味着 2^16-1 以上的所有 unicode 代码点必须用两个字符表示。 (实际上,编码方案比这更复杂。)无论如何,< code>length 是一种在较低级别(字符)上运行的方法,因此它返回字符数。
如果您想找出代码点的数量,这可能是您在说“两个 unicode 字符”时直观地想到的(例如打印出的两个符号),则需要使用
s.codePointCount(0 ,s.length)
。如果您想将它们转换为十六进制,则需要使用代码点而不是Char
,因为并非所有代码点都适合。我对这个问题的回答包含将字符串转换为代码点的Scala代码。 (效率不高;如果您要对大字符串进行繁重的文本处理,则需要重写它以使用数组/ArrayBuffer。)Java (and therefore Scala) use UTF-16 encoding for their string, which means that all unicode code points above 2^16-1 must be represented with two characters. (Actually, the encoding scheme is a bit more complex than that.) Anyway,
length
is a method that operates at a lower level--characters--so it returns the number of characters.If you want to find out the number of code points, which is what you probably are thinking of intuitively when you say "two unicode characters" (e.g. two symbols that print out), you need to use
s.codePointCount(0,s.length)
. And if you want to convert those to hex, you need to be working with code points notChar
s, since not all code points fit. My answer to this question contains Scala code to convert a string to code points. (Not with maximal efficiency; you'd want to rewrite it to use arrays/ArrayBuffer if you're doing heavy-duty text processing on large strings.)这就是他们在 unicode 中所说的“代理”。例如,
It is what they called "surrogate" in unicode speak. For instance,
prints
BTW, I am based in Taiwan as well. If you are interested in Scala, we should get together and talk shop. My email is in my profile if you are interested.
检查文件编码。您的 IDE 或构建脚本必须知道该文件是 UTF-8 或 UTF-16(您使用哪一种?)。如果您定义了 BOM,请检查它是否合适。
Check the file encoding. Your IDE or your build script must know that the file is either UTF-8 or UTF-16 (which one do you use?). If you define BOM then check that it is appropriate.