如何从Java中的文件中读取大于0xffff的Unicode codepoints
我正在为编译器编写词汇分析仪,我想知道如何读取包含大于0xffff的Unicode Codepoints的UTF-8文件。 char
数据类型仅支持两个字节,那么如何从文件中读取int
codepoint?
I'm writing a lexical analyzer for a compiler and I was wondering how I can read a UTF-8 file that contains unicode codepoints greater than 0xFFFF. The char
data type only supports two bytes, so how can I read an int
codepoint from the file?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我最近必须这样做。这是我使用的代码。这是一个
spliterator.ofint
实现,可用于从reader
的输入中创建intstream
codepoints ,如果更容易,则直接使用。或者只需从nextCP
方法中提取逻辑即可。示例用法:
或
等。
您还可以使用
files.readstring()
将整个文件读为字符串,并使用String#codepoints
或其他codepoint方法,但是上面的课程更有效地有效,因为它一次仅读取字符。或一次读取一行并将其转换为编码点。I had to do this recently; here's the code I used. It's a
Spliterator.OfInt
implementation that can be used to create anIntStream
of codepoints from input from aReader
, or used directly if that's easier. Or just extract the logic from thenextCP
method.Example usage:
or
etc.
You can also use
Files.readString()
to read an entire file into a string and useString#codePoints
or other codepoint methods on it, but the above class is more memory efficient if that matters because it only reads a character at a time. Or read a line at a time and convert those to codepoints.