Java 读取带有补充 unicode 字符的字符流

发布于 2024-12-09 13:24:21 字数 464 浏览 0 评论 0原文

我在使用 Java 读取补充 unicode 字符时遇到问题。我有一个文件可能包含补充集中的字符(任何大于 \uFFFF 的字符)。当我设置 InputStreamReader 使用 UTF-8 读取文件时,我希望 read() 方法为每个补充字符返回单个字符,而不是它似乎在 16 位阈值上分割。

我看到了一些有关基本 unicode 字符流的其他问题,但似乎没有任何内容可以处理大于 16 位的情况。

下面是一些简化的示例代码:

InputStreamReader input = new InputStreamReader(file, "UTF8");
int nextChar = input.read();
while(nextChar != -1) {
    ...
    nextChar = input.read();
}

有谁知道我需要做什么才能正确读取包含增补字符的 UTF-8 编码文件?

I'm having trouble reading in supplementary unicode characters using Java. I have a file that potentially contains characters in the supplementary set (anything greater than \uFFFF). When I setup my InputStreamReader to read the file using UTF-8 I would expect the read() method to return a single character for each supplementary character, instead it seems to split on the 16 bit threshold.

I saw some other questions about basic unicode character streams, but nothing seems to deal with the greater than 16 bit case.

Here's some simplified sample code:

InputStreamReader input = new InputStreamReader(file, "UTF8");
int nextChar = input.read();
while(nextChar != -1) {
    ...
    nextChar = input.read();
}

Does anyone know what I need to do to correctly read in a UTF-8 encoded file that contains supplementary characters?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

剩余の解释 2024-12-16 13:24:21

Java 使用 UTF-16。因此,如果您的输入流有星体字符,它们将显示为代理对,即两个 char。第一个字符是高代理项,第二个字符是低代理项。

Java works with UTF-16. So, if your input stream has astral characters, they will appear as a surrogate pair, i.e., as two chars. The first character is the high surrogate, and the second character is the low surrogate.

久光 2024-12-16 13:24:21

虽然read()被定义为返回int,并且理论上可以“一次”返回补充字符的代码点,但我相信返回类型只是int 允许返回值-1。

read() 获得的值基本上是另一个名称的 char,而 Java char 仅限于 16 位。

Java 只能将增补字符表示为 UTF-16 代理对,一旦超过 0xFFFF,就不存在“单个字符”(至少在 char 意义上)了。担心的。

Though read() is defined to return int, and could theoretically return a supplementary character's code point "all at once", I believe the return type is only int to allow a value of -1 to be returned.

The value you're getting from read() is basically a char by another name, and Java a char is limited to 16 bits.

Java can only represent supplementary characters as a UTF-16 surrogate pair, there is no such thing as a "single character" (at least in the char sense) once you get above 0xFFFF as far as Java is concerned.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文