当前位置：文江博客话题详情

如何从Java中的文件中读取大于0xffff的Unicode codepoints

发布于 2025-01-27 18:24:13 字数 139 浏览 2 评论 0原文

我正在为编译器编写词汇分析仪，我想知道如何读取包含大于0xffff的Unicode Codepoints的UTF-8文件。 char数据类型仅支持两个字节，那么如何从文件中读取int codepoint？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

柠檬 2025-02-03 18:24:17

我最近必须这样做。这是我使用的代码。这是一个spliterator.ofint实现，可用于从reader的输入中创建intstream codepoints ，如果更容易，则直接使用。或者只需从nextCP方法中提取逻辑即可。

package org.raevnos.util.iterator;

import java.util.Objects;
import java.util.Spliterator;
import java.util.function.IntConsumer;
import java.io.Reader;
import java.io.Closeable;
import java.io.IOException;
import java.io.UncheckedIOException;
import java.nio.charset.CharacterCodingException;

/**
 * A {@code Spliterator.OfInt} used to iterate over codepoints read from a file.
 */
public class CPSpliterator
    implements Spliterator.OfInt, Closeable {
    private final Reader input;

    /**
     * Create a new spliterator.
     * @param input The {@code Reader} to get codepoints from.
     */
    public CPSpliterator(Reader input) {
        this.input = Objects.requireNonNull(input);
    }

    /**
     * Fetch the next codepoint from the underlying stream, accounting for
     * surrogate pairs.
     * @return a codepoint, or -1 on end of file.
     * @throws UncheckedIOException on input errors.
     */
    private int nextCP() {
        try {
            int first_char = input.read();
            if (first_char == -1) {
                return -1;
            } else if (Character.isHighSurrogate((char)first_char)) {
                int second_char = input.read();
                if (second_char == -1
                    || !Character.isLowSurrogate((char)second_char)) {
                    // Hopefully shouldn't happen; caught by Reader first.
                    throw new CharacterCodingException();
                } else {
                    return Character.toCodePoint((char)first_char, (char)second_char);
                }
            } else {
                return first_char;
            }
        } catch (IOException e) {
            throw new UncheckedIOException(e);
        }
    }

    @Override
    public int characteristics() { return ORDERED | NONNULL; }

    @Override
    public long estimateSize() { return Long.MAX_VALUE; }

    @Override
    public void forEachRemaining(IntConsumer f) {
        int cp;
        while ((cp = nextCP()) != -1) {
            f.accept(cp);
        }
    }

    @Override
    public boolean tryAdvance(IntConsumer f) {
        int cp = nextCP();
        if (cp != -1) {
            f.accept(cp);
            return true;
        } else {
            return false;
        }
    }

    @Override
    public Spliterator.OfInt trySplit() { return null; }

    @Override
    public void close() throws IOException { input.close(); }
}

示例用法：

try (CPSpliterator sp = new CPSpliterator(Files.newBufferedReader(Path.of(whereEver)))) {
    IntStream codepoints = StreamSupport.intStream(sp, false);
    // do something with the stream
}

或

try (CPSpliterator sp = new CPSpliterator(Files.newBufferedReader(Path.of(whereEver)))) {
    sp.forEachRemaining(cp -> doSomething(cp));
}

等。

您还可以使用files.readstring（）将整个文件读为字符串，并使用String＃codepoints或其他codepoint方法，但是上面的课程更有效地有效，因为它一次仅读取字符。或一次读取一行并将其转换为编码点。

I had to do this recently; here's the code I used. It's a Spliterator.OfInt implementation that can be used to create an IntStream of codepoints from input from a Reader, or used directly if that's easier. Or just extract the logic from the nextCP method.

package org.raevnos.util.iterator;

import java.util.Objects;
import java.util.Spliterator;
import java.util.function.IntConsumer;
import java.io.Reader;
import java.io.Closeable;
import java.io.IOException;
import java.io.UncheckedIOException;
import java.nio.charset.CharacterCodingException;

/**
 * A {@code Spliterator.OfInt} used to iterate over codepoints read from a file.
 */
public class CPSpliterator
    implements Spliterator.OfInt, Closeable {
    private final Reader input;

    /**
     * Create a new spliterator.
     * @param input The {@code Reader} to get codepoints from.
     */
    public CPSpliterator(Reader input) {
        this.input = Objects.requireNonNull(input);
    }

    /**
     * Fetch the next codepoint from the underlying stream, accounting for
     * surrogate pairs.
     * @return a codepoint, or -1 on end of file.
     * @throws UncheckedIOException on input errors.
     */
    private int nextCP() {
        try {
            int first_char = input.read();
            if (first_char == -1) {
                return -1;
            } else if (Character.isHighSurrogate((char)first_char)) {
                int second_char = input.read();
                if (second_char == -1
                    || !Character.isLowSurrogate((char)second_char)) {
                    // Hopefully shouldn't happen; caught by Reader first.
                    throw new CharacterCodingException();
                } else {
                    return Character.toCodePoint((char)first_char, (char)second_char);
                }
            } else {
                return first_char;
            }
        } catch (IOException e) {
            throw new UncheckedIOException(e);
        }
    }

    @Override
    public int characteristics() { return ORDERED | NONNULL; }

    @Override
    public long estimateSize() { return Long.MAX_VALUE; }

    @Override
    public void forEachRemaining(IntConsumer f) {
        int cp;
        while ((cp = nextCP()) != -1) {
            f.accept(cp);
        }
    }

    @Override
    public boolean tryAdvance(IntConsumer f) {
        int cp = nextCP();
        if (cp != -1) {
            f.accept(cp);
            return true;
        } else {
            return false;
        }
    }

    @Override
    public Spliterator.OfInt trySplit() { return null; }

    @Override
    public void close() throws IOException { input.close(); }
}

Example usage:

try (CPSpliterator sp = new CPSpliterator(Files.newBufferedReader(Path.of(whereEver)))) {
    IntStream codepoints = StreamSupport.intStream(sp, false);
    // do something with the stream
}

try (CPSpliterator sp = new CPSpliterator(Files.newBufferedReader(Path.of(whereEver)))) {
    sp.forEachRemaining(cp -> doSomething(cp));
}

etc.

You can also use Files.readString() to read an entire file into a string and use String#codePoints or other codepoint methods on it, but the above class is more memory efficient if that matters because it only reads a character at a time. Or read a line at a time and convert those to codepoints.

回复收藏 0 原文

~没有更多了~