将 Unicode (CJK ExtB) 字符转换为 Java/Scala 中的十进制 NCR

发布于 2024-10-21 01:09:26 字数 290 浏览 7 评论 0原文

我正在尝试将 CJK ExtB 计划中包含 Unicode 字符的 Java 字符串转换为十进制 NCR。

例如(您可以尝试使用 http://people.w3.org/rishida/tools/ conversion/ ):

  • "游钖堃" 应转换为 游鍚堃
  • "

I'm trying to convert a Java string contains Unicode character in CJK ExtB plan to Decimal NCRs.

For example (you could try it with http://people.w3.org/rishida/tools/conversion/ ):

  • "游鍚堃" should convert to 游鍚堃
  • "????懷" should convert to 𧦧懷

Here is what I tried (in Scala):

def charToHex(char: Char) = "&#%d;" format(char.toInt)
def stringToHex (string: String) = string.flatMap(charToHex)

println (stringToHex("游鍚堃")) // 游鍚堃
println (stringToHex("????懷"))   // ��懷
println ("????懷".toCharArray().length) // Why it is 3?

As you can see, it convert correctly in the first case, three unicode characters to three NCRs.

But in the second case "????懷", there are only two unicode characters, but Java/Scala seems to think it is a string contains three characters.

So, what is happening here and how could I convert the second case correctly just like the converter on the site I mentioned? Thanks a lot.

Update:

  • My source code file is using UTF-8.
  • Here is the result of "????懷".toCharArray()
    • char[] = ?, char.toInt = 55390
    • char[] = ?, char.toInt = 56743
    • char[] = 懷, char.toInt = 25079

Now I think I know what happened. The character "????" is encoded as 0xD85E 0xDDA7 in UTF-16, which is 4 bytes instead of 2 bytes. So it takes 2 elements when convert to array of char, where data type char could only represent 2 bytes.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

一口甜 2024-10-28 01:09:26

Java(以及 Scala)对其字符串使用 UTF-16 编码,这意味着 2^16-1 以上的所有 unicode 代码点必须用两个字符表示。 (实际上,编码方案比这更复杂。)无论如何,< code>length 是一种在较低级别(字符)上运行的方法,因此它返回字符数。

如果您想找出代码点的数量,这可能是您在说“两个 unicode 字符”时直观地想到的(例如打印出的两个符号),则需要使用 s.codePointCount(0 ,s.length)。如果您想将它们转换为十六进制,则需要使用代码点而不是 Char,因为并非所有代码点都适合。我对这个问题的回答包含将字符串转换为代码点的Scala代码。 (效率不高;如果您要对大字符串进行繁重的文本处理,则需要重写它以使用数组/ArrayBuffer。)

Java (and therefore Scala) use UTF-16 encoding for their string, which means that all unicode code points above 2^16-1 must be represented with two characters. (Actually, the encoding scheme is a bit more complex than that.) Anyway, length is a method that operates at a lower level--characters--so it returns the number of characters.

If you want to find out the number of code points, which is what you probably are thinking of intuitively when you say "two unicode characters" (e.g. two symbols that print out), you need to use s.codePointCount(0,s.length). And if you want to convert those to hex, you need to be working with code points not Chars, since not all code points fit. My answer to this question contains Scala code to convert a string to code points. (Not with maximal efficiency; you'd want to rewrite it to use arrays/ArrayBuffer if you're doing heavy-duty text processing on large strings.)

唔猫 2024-10-28 01:09:26

这就是他们在 unicode 中所说的“代理”。例如,

"

It is what they called "surrogate" in unicode speak. For instance,

"????懷" foreach { c =>
  println(java.lang.Character.UnicodeBlock.of(c))
}

prints

HIGH_SURROGATES
LOW_SURROGATES
CJK_UNIFIED_IDEOGRAPHS

BTW, I am based in Taiwan as well. If you are interested in Scala, we should get together and talk shop. My email is in my profile if you are interested.

晨敛清荷 2024-10-28 01:09:26

检查文件编码。您的 IDE 或构建脚本必须知道该文件是 UTF-8 或 UTF-16(您使用哪一种?)。如果您定义了 BOM,请检查它是否合适。

Check the file encoding. Your IDE or your build script must know that the file is either UTF-8 or UTF-16 (which one do you use?). If you define BOM then check that it is appropriate.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文