如何从Java中的文件中读取大于0xffff的Unicode codepoints

发布于 2025-01-27 18:24:13 字数 139 浏览 2 评论 0原文

我正在为编译器编写词汇分析仪,我想知道如何读取包含大于0xffff的Unicode Codepoints的UTF-8文件。 char数据类型仅支持两个字节,那么如何从文件中读取int codepoint?

I'm writing a lexical analyzer for a compiler and I was wondering how I can read a UTF-8 file that contains unicode codepoints greater than 0xFFFF. The char data type only supports two bytes, so how can I read an int codepoint from the file?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

柠檬 2025-02-03 18:24:17

我最近必须这样做。这是我使用的代码。这是一个spliterator.ofint实现,可用于从reader的输入中创建intstream codepoints ,如果更容易,则直接使用。或者只需从nextCP方法中提取逻辑即可。

package org.raevnos.util.iterator;

import java.util.Objects;
import java.util.Spliterator;
import java.util.function.IntConsumer;
import java.io.Reader;
import java.io.Closeable;
import java.io.IOException;
import java.io.UncheckedIOException;
import java.nio.charset.CharacterCodingException;

/**
 * A {@code Spliterator.OfInt} used to iterate over codepoints read from a file.
 */
public class CPSpliterator
    implements Spliterator.OfInt, Closeable {
    private final Reader input;

    /**
     * Create a new spliterator.
     * @param input The {@code Reader} to get codepoints from.
     */
    public CPSpliterator(Reader input) {
        this.input = Objects.requireNonNull(input);
    }

    /**
     * Fetch the next codepoint from the underlying stream, accounting for
     * surrogate pairs.
     * @return a codepoint, or -1 on end of file.
     * @throws UncheckedIOException on input errors.
     */
    private int nextCP() {
        try {
            int first_char = input.read();
            if (first_char == -1) {
                return -1;
            } else if (Character.isHighSurrogate((char)first_char)) {
                int second_char = input.read();
                if (second_char == -1
                    || !Character.isLowSurrogate((char)second_char)) {
                    // Hopefully shouldn't happen; caught by Reader first.
                    throw new CharacterCodingException();
                } else {
                    return Character.toCodePoint((char)first_char, (char)second_char);
                }
            } else {
                return first_char;
            }
        } catch (IOException e) {
            throw new UncheckedIOException(e);
        }
    }

    @Override
    public int characteristics() { return ORDERED | NONNULL; }

    @Override
    public long estimateSize() { return Long.MAX_VALUE; }

    @Override
    public void forEachRemaining(IntConsumer f) {
        int cp;
        while ((cp = nextCP()) != -1) {
            f.accept(cp);
        }
    }

    @Override
    public boolean tryAdvance(IntConsumer f) {
        int cp = nextCP();
        if (cp != -1) {
            f.accept(cp);
            return true;
        } else {
            return false;
        }
    }

    @Override
    public Spliterator.OfInt trySplit() { return null; }

    @Override
    public void close() throws IOException { input.close(); }
}

示例用法:

try (CPSpliterator sp = new CPSpliterator(Files.newBufferedReader(Path.of(whereEver)))) {
    IntStream codepoints = StreamSupport.intStream(sp, false);
    // do something with the stream
}

try (CPSpliterator sp = new CPSpliterator(Files.newBufferedReader(Path.of(whereEver)))) {
    sp.forEachRemaining(cp -> doSomething(cp));
}

等。

您还可以使用files.readstring()将整个文件读为字符串,并使用String#codepoints或其他codepoint方法,但是上面的课程更有效地有效,因为它一次仅读取字符。或一次读取一行并将其转换为编码点。

I had to do this recently; here's the code I used. It's a Spliterator.OfInt implementation that can be used to create an IntStream of codepoints from input from a Reader, or used directly if that's easier. Or just extract the logic from the nextCP method.

package org.raevnos.util.iterator;

import java.util.Objects;
import java.util.Spliterator;
import java.util.function.IntConsumer;
import java.io.Reader;
import java.io.Closeable;
import java.io.IOException;
import java.io.UncheckedIOException;
import java.nio.charset.CharacterCodingException;

/**
 * A {@code Spliterator.OfInt} used to iterate over codepoints read from a file.
 */
public class CPSpliterator
    implements Spliterator.OfInt, Closeable {
    private final Reader input;

    /**
     * Create a new spliterator.
     * @param input The {@code Reader} to get codepoints from.
     */
    public CPSpliterator(Reader input) {
        this.input = Objects.requireNonNull(input);
    }

    /**
     * Fetch the next codepoint from the underlying stream, accounting for
     * surrogate pairs.
     * @return a codepoint, or -1 on end of file.
     * @throws UncheckedIOException on input errors.
     */
    private int nextCP() {
        try {
            int first_char = input.read();
            if (first_char == -1) {
                return -1;
            } else if (Character.isHighSurrogate((char)first_char)) {
                int second_char = input.read();
                if (second_char == -1
                    || !Character.isLowSurrogate((char)second_char)) {
                    // Hopefully shouldn't happen; caught by Reader first.
                    throw new CharacterCodingException();
                } else {
                    return Character.toCodePoint((char)first_char, (char)second_char);
                }
            } else {
                return first_char;
            }
        } catch (IOException e) {
            throw new UncheckedIOException(e);
        }
    }

    @Override
    public int characteristics() { return ORDERED | NONNULL; }

    @Override
    public long estimateSize() { return Long.MAX_VALUE; }

    @Override
    public void forEachRemaining(IntConsumer f) {
        int cp;
        while ((cp = nextCP()) != -1) {
            f.accept(cp);
        }
    }

    @Override
    public boolean tryAdvance(IntConsumer f) {
        int cp = nextCP();
        if (cp != -1) {
            f.accept(cp);
            return true;
        } else {
            return false;
        }
    }

    @Override
    public Spliterator.OfInt trySplit() { return null; }

    @Override
    public void close() throws IOException { input.close(); }
}

Example usage:

try (CPSpliterator sp = new CPSpliterator(Files.newBufferedReader(Path.of(whereEver)))) {
    IntStream codepoints = StreamSupport.intStream(sp, false);
    // do something with the stream
}

or

try (CPSpliterator sp = new CPSpliterator(Files.newBufferedReader(Path.of(whereEver)))) {
    sp.forEachRemaining(cp -> doSomething(cp));
}

etc.

You can also use Files.readString() to read an entire file into a string and use String#codePoints or other codepoint methods on it, but the above class is more memory efficient if that matters because it only reads a character at a time. Or read a line at a time and convert those to codepoints.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文