将 Unicode (CJK ExtB) 字符转换为 Java/Scala 中的十进制 NCR

发布于 2024-10-21 01:09:26 字数 290 浏览 10 评论 0原文

我正在尝试将 CJK ExtB 计划中包含 Unicode 字符的 Java 字符串转换为十进制 NCR。

例如（您可以尝试使用 http://people.w3.org/rishida/tools/ conversion/ ):

"游钖堃" 应转换为 游鍚堃
"

原文

I'm trying to convert a Java string contains Unicode character in CJK ExtB plan to Decimal NCRs.

For example (you could try it with http://people.w3.org/rishida/tools/conversion/ ):

"游鍚堃" should convert to 游鍚堃
"????懷" should convert to 𧦧懷

Here is what I tried (in Scala):

def charToHex(char: Char) = "&#%d;" format(char.toInt)
def stringToHex (string: String) = string.flatMap(charToHex)

println (stringToHex("游鍚堃")) // 游鍚堃
println (stringToHex("????懷"))   // ��懷
println ("????懷".toCharArray().length) // Why it is 3?

As you can see, it convert correctly in the first case, three unicode characters to three NCRs.

But in the second case "????懷", there are only two unicode characters, but Java/Scala seems to think it is a string contains three characters.

So, what is happening here and how could I convert the second case correctly just like the converter on the site I mentioned? Thanks a lot.

Update:

My source code file is using UTF-8.
Here is the result of "????懷".toCharArray()
- char[] = ?, char.toInt = 55390
- char[] = ?, char.toInt = 56743
- char[] = 懷, char.toInt = 25079

Now I think I know what happened. The character "????" is encoded as 0xD85E 0xDDA7 in UTF-16, which is 4 bytes instead of 2 bytes. So it takes 2 elements when convert to array of char, where data type char could only represent 2 bytes.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一口甜 2024-10-28 01:09:26

Java（以及 Scala）对其字符串使用 UTF-16 编码，这意味着 2^16-1 以上的所有 unicode 代码点必须用两个字符表示。（实际上，编码方案比这更复杂。）无论如何，< code>length 是一种在较低级别（字符）上运行的方法，因此它返回字符数。

如果您想找出代码点的数量，这可能是您在说“两个 unicode 字符”时直观地想到的（例如打印出的两个符号），则需要使用 s.codePointCount(0 ，s.length）。如果您想将它们转换为十六进制，则需要使用代码点而不是 Char，因为并非所有代码点都适合。我对这个问题的回答包含将字符串转换为代码点的Scala代码。（效率不高；如果您要对大字符串进行繁重的文本处理，则需要重写它以使用数组/ArrayBuffer。）

回复收藏 0 原文

唔猫 2024-10-28 01:09:26

这就是他们在 unicode 中所说的“代理”。例如，

It is what they called "surrogate" in unicode speak. For instance,

"????懷" foreach { c =>
  println(java.lang.Character.UnicodeBlock.of(c))
}

prints

HIGH_SURROGATES
LOW_SURROGATES
CJK_UNIFIED_IDEOGRAPHS

BTW, I am based in Taiwan as well. If you are interested in Scala, we should get together and talk shop. My email is in my profile if you are interested.

回复收藏 0 原文