如何在不复制Java的情况下解码MappedByteBuffer

发布于 2025-02-12 10:48:58 字数 198 浏览 1 评论 0原文

如何在Java中解码mappedbytebuffer的UTF-8代码点,而无需复制缓冲区?将缓冲区复制到内存将打败映射内存的点,并且由于charsetDecoder将要求我使用tochararray方法复制缓冲区,因此它将失败映射的目的内存。有什么方法可以在不复制缓冲区的情况下有效地解码它?

How can I decode the UTF-8 codepoints of a MappedByteBuffer in Java without copying the buffer? Copying the buffer into memory would defeat the point of mapping the memory, and since CharsetDecoder would require me to copy the buffer with the toCharArray method, it would defeat the purpose of mapping the memory. Is there any way to efficiently decode it without copying the buffer?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

听风念你 2025-02-19 10:48:58

这是一个更完整的答案。

它读取块中的输入文件&将结果存储在charbuffer中。

import static java.nio.charset.StandardCharsets.UTF_8;

import java.io.IOException;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.channels.FileChannel;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CoderResult;
import java.nio.charset.MalformedInputException;
import java.nio.charset.UnmappableCharacterException;
import java.nio.file.Files;
import java.nio.file.Path;

public class ChunkedUtf8Decoder {
    /*
     * Valid UTF-8 Multi-Bytes
     * -----------------------
     * 1-Byte : 0xxxxxxx
     * 2-Byte : 110xxxxx + 10xxxxxx
     * 3-Byte : 1110xxxx + 10xxxxxx + 10xxxxxx
     * 4-Byte : 11110xxx + 10xxxxxx + 10xxxxxx + 10xxxxxx
     */
    private static final int            UTF8_CONTINUE_MASK   = 0b11_000000;
    private static final int            UTF8_INVALID_MASK    = 0b11111_000;

    private static final int            UTF8_CONTINUE_PREFIX = 0b10_000000;
    private static final int            UTF8_INVALID_PREFIX  = UTF8_INVALID_MASK;

    private static final int            CHUNK_SIZE           = 24; // TODO Test value! Try something larger, e.g. 64_000

    private        final CharsetDecoder utf8Decoder          = UTF_8.newDecoder();
    private        final ByteBuffer     bb                   = ByteBuffer.allocate(CHUNK_SIZE); // TODO maybe allocateDirect?
    private        final CharBuffer     cb;
    private        final Path           inputPath;

    private              byte[]         remainder            = {};
    private              int            bytesChunked         = 0;

    public ChunkedUtf8Decoder(final Path inputPath) throws IOException {

        this.inputPath = inputPath;
        this.cb        = CharBuffer.allocate(Math.toIntExact(Files.size(inputPath)));

//      this.utf8Decoder.onMalformedInput     (CodingErrorAction.REPLACE); // (TODO Default is REPORT)
//      this.utf8Decoder.onUnmappableCharacter(CodingErrorAction.REPLACE); // (TODO Default is REPORT)
    }

    /**
     * Split the Input-File into Chunks & Decode them, appending the result to our CharBuffer.
     * 
     * @throws IOException
     */
    public void decode() throws IOException {

        try(final FileChannel  channel = FileChannel.open(inputPath))
        {
            while (channel.read(bb) != -1) {

                debugBytesIn("Read......:", false);

                this.remainder    =  backupToCharBoundary();

                debugBytesIn("Backed up.:", true);

                this.bytesChunked += decodeChunk();

                bb.clear();             // (position=0, limit=capacity, mark=-1)
                bb.put(this.remainder); // (Final Remainder is dealt with below)
            }
            debugBytesIn("EOF.......:", false);
            /*
             * Lastly, deal with Final Remainder (0 to 4 Bytes) @ start of Buffer...
             */
            decodeChunk();
        }
    }

    /**
     * We try to back up the BB to a Character boundary.
     * If the file is correctly encoded, the shorter BB will then decode OK.
     * The Remainder will be processed after the NEXT Read operation.
     * 
     * @return
     * @throws MalformedInputException
     */
    private byte[] backupToCharBoundary() throws MalformedInputException {

        for (int i = 1; i <= 4; i++) {

            final int trailingBytePos = bb.position() - i;

            if (trailingBytePos < 0) {
                /*
                 *  If there were too few Bytes available, carry them over until either
                 *  a) more Bytes become available, or
                 *  b) EOF is reached
                 */
                final byte[] remainder = new byte[bb.position()];

                bb.get  (0,  remainder);
                bb.clear();       // (position=0, limit=capacity, mark=-1)

                return remainder; // (Entire contents of BB in Remainder, BB Empty)
            }

            final int trailingByte = 0xFF & bb.get(trailingBytePos);

            /*
             * We stop as soon as we encounter a Character-Selector...
             * (The following 2 intervals are either invalid or a continuation-character)
             */
            if ((trailingByte  &  UTF8_INVALID_MASK )  !=  UTF8_INVALID_PREFIX     // 11111xxx
            &&  (trailingByte  &  UTF8_CONTINUE_MASK)  !=  UTF8_CONTINUE_PREFIX) { // 10xxxxxx
                /*
                 *  OK, we have found a (1-, 2-, 3- or 4-Byte) Character-Selector...
                 */
                final byte[] remainder = new byte[i];

                bb.get     (trailingBytePos, remainder);
                bb.position(trailingBytePos);

                return remainder; // (Trailing 1-4 Bytes of BB in Remainder)
            }
        }
        /*
         * Although at least 4 Bytes are available, We couldn't find a Character-Selector!
         * This is an error.
         * We leave the ByteBuffer unchanged & return an empty Remainder.
         * The CharsetDecoder will produce a MalformedInputException in due course...
         */
        return new byte[0];
    }

    /**
     * Decode the current Chunk of Bytes & append the result to our CB.
     * 
     * @return
     * @throws CharacterCodingException
     */
    private int decodeChunk() throws CharacterCodingException {
        try {
            this.bb.flip(); // (limit=position, position=0, mark=-1)

            ;                       this.utf8Decoder.reset();
            final CoderResult res = this.utf8Decoder.decode(this.bb, this.cb, true);

            System.out.println("CB Chars..: " + this.cb.position());
            /*
             * Why on earth is UNDERFLOW the correct result of an error-free Decode?!
             */
            if (res.isUnderflow() == false) {
                res.throwException();
            }
            return this.bb.position();
        }
        catch (final  MalformedInputException e) {
            throw new MalformedInputException     (this.bytesChunked + e.getInputLength());
        }
        catch (final  UnmappableCharacterException e) {
            throw new UnmappableCharacterException(this.bytesChunked + e.getInputLength());
            /*
             * (Famous Last Words "UnmappableCharacterException never occurs with UTF-8 Decoder")
             */
        }
        catch (final CharacterCodingException inputUnderflowOrOutputOverflow) {
            throw                             inputUnderflowOrOutputOverflow;
            /*
             * (Underflow if the final Character of the final Chunk was incomplete)
             * (Overflow  if the Output Buffer is too small, which SHOULD NOT HAPPEN with our big CB)
             */
        }
    }

    /**
     * Some Debug Writes.
     * 
     * @param debugText
     * @param includeRemainder
     */
    private void debugBytesIn(final String debugText, final boolean includeRemainder) {

        System        .out.print(debugText + " previous=" + this.bytesChunked + " bbPos=" + bb.position() + " Bytes.:");

        for (    int p = 0; p < bb.position();         p++) {
            System    .out.print(" " + Integer.toHexString(0xFF & bb.get(p)));
        }
        if (includeRemainder) {
            System    .out.print(" Remainder.:");

            for (int r = 0; r < this.remainder.length; r++) {
                System.out.print(" " + Integer.toHexString(0xFF & this.remainder[r]));
            }
        }
        System        .out.println();
    }

    public static void main(final String[] args) throws IOException {

        final ChunkedUtf8Decoder utf8Decoder = new ChunkedUtf8Decoder(Path.of("Utf8-Chars_nn.txt"));
        ;                        utf8Decoder.decode();

        System.out.println      (utf8Decoder.cb.flip()); // TODO Note.: flip()
    }
}

Here is a somewhat more complete answer.

It reads the Input-File in Chunks & stores the result in a CharBuffer.

import static java.nio.charset.StandardCharsets.UTF_8;

import java.io.IOException;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.channels.FileChannel;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CoderResult;
import java.nio.charset.MalformedInputException;
import java.nio.charset.UnmappableCharacterException;
import java.nio.file.Files;
import java.nio.file.Path;

public class ChunkedUtf8Decoder {
    /*
     * Valid UTF-8 Multi-Bytes
     * -----------------------
     * 1-Byte : 0xxxxxxx
     * 2-Byte : 110xxxxx + 10xxxxxx
     * 3-Byte : 1110xxxx + 10xxxxxx + 10xxxxxx
     * 4-Byte : 11110xxx + 10xxxxxx + 10xxxxxx + 10xxxxxx
     */
    private static final int            UTF8_CONTINUE_MASK   = 0b11_000000;
    private static final int            UTF8_INVALID_MASK    = 0b11111_000;

    private static final int            UTF8_CONTINUE_PREFIX = 0b10_000000;
    private static final int            UTF8_INVALID_PREFIX  = UTF8_INVALID_MASK;

    private static final int            CHUNK_SIZE           = 24; // TODO Test value! Try something larger, e.g. 64_000

    private        final CharsetDecoder utf8Decoder          = UTF_8.newDecoder();
    private        final ByteBuffer     bb                   = ByteBuffer.allocate(CHUNK_SIZE); // TODO maybe allocateDirect?
    private        final CharBuffer     cb;
    private        final Path           inputPath;

    private              byte[]         remainder            = {};
    private              int            bytesChunked         = 0;

    public ChunkedUtf8Decoder(final Path inputPath) throws IOException {

        this.inputPath = inputPath;
        this.cb        = CharBuffer.allocate(Math.toIntExact(Files.size(inputPath)));

//      this.utf8Decoder.onMalformedInput     (CodingErrorAction.REPLACE); // (TODO Default is REPORT)
//      this.utf8Decoder.onUnmappableCharacter(CodingErrorAction.REPLACE); // (TODO Default is REPORT)
    }

    /**
     * Split the Input-File into Chunks & Decode them, appending the result to our CharBuffer.
     * 
     * @throws IOException
     */
    public void decode() throws IOException {

        try(final FileChannel  channel = FileChannel.open(inputPath))
        {
            while (channel.read(bb) != -1) {

                debugBytesIn("Read......:", false);

                this.remainder    =  backupToCharBoundary();

                debugBytesIn("Backed up.:", true);

                this.bytesChunked += decodeChunk();

                bb.clear();             // (position=0, limit=capacity, mark=-1)
                bb.put(this.remainder); // (Final Remainder is dealt with below)
            }
            debugBytesIn("EOF.......:", false);
            /*
             * Lastly, deal with Final Remainder (0 to 4 Bytes) @ start of Buffer...
             */
            decodeChunk();
        }
    }

    /**
     * We try to back up the BB to a Character boundary.
     * If the file is correctly encoded, the shorter BB will then decode OK.
     * The Remainder will be processed after the NEXT Read operation.
     * 
     * @return
     * @throws MalformedInputException
     */
    private byte[] backupToCharBoundary() throws MalformedInputException {

        for (int i = 1; i <= 4; i++) {

            final int trailingBytePos = bb.position() - i;

            if (trailingBytePos < 0) {
                /*
                 *  If there were too few Bytes available, carry them over until either
                 *  a) more Bytes become available, or
                 *  b) EOF is reached
                 */
                final byte[] remainder = new byte[bb.position()];

                bb.get  (0,  remainder);
                bb.clear();       // (position=0, limit=capacity, mark=-1)

                return remainder; // (Entire contents of BB in Remainder, BB Empty)
            }

            final int trailingByte = 0xFF & bb.get(trailingBytePos);

            /*
             * We stop as soon as we encounter a Character-Selector...
             * (The following 2 intervals are either invalid or a continuation-character)
             */
            if ((trailingByte  &  UTF8_INVALID_MASK )  !=  UTF8_INVALID_PREFIX     // 11111xxx
            &&  (trailingByte  &  UTF8_CONTINUE_MASK)  !=  UTF8_CONTINUE_PREFIX) { // 10xxxxxx
                /*
                 *  OK, we have found a (1-, 2-, 3- or 4-Byte) Character-Selector...
                 */
                final byte[] remainder = new byte[i];

                bb.get     (trailingBytePos, remainder);
                bb.position(trailingBytePos);

                return remainder; // (Trailing 1-4 Bytes of BB in Remainder)
            }
        }
        /*
         * Although at least 4 Bytes are available, We couldn't find a Character-Selector!
         * This is an error.
         * We leave the ByteBuffer unchanged & return an empty Remainder.
         * The CharsetDecoder will produce a MalformedInputException in due course...
         */
        return new byte[0];
    }

    /**
     * Decode the current Chunk of Bytes & append the result to our CB.
     * 
     * @return
     * @throws CharacterCodingException
     */
    private int decodeChunk() throws CharacterCodingException {
        try {
            this.bb.flip(); // (limit=position, position=0, mark=-1)

            ;                       this.utf8Decoder.reset();
            final CoderResult res = this.utf8Decoder.decode(this.bb, this.cb, true);

            System.out.println("CB Chars..: " + this.cb.position());
            /*
             * Why on earth is UNDERFLOW the correct result of an error-free Decode?!
             */
            if (res.isUnderflow() == false) {
                res.throwException();
            }
            return this.bb.position();
        }
        catch (final  MalformedInputException e) {
            throw new MalformedInputException     (this.bytesChunked + e.getInputLength());
        }
        catch (final  UnmappableCharacterException e) {
            throw new UnmappableCharacterException(this.bytesChunked + e.getInputLength());
            /*
             * (Famous Last Words "UnmappableCharacterException never occurs with UTF-8 Decoder")
             */
        }
        catch (final CharacterCodingException inputUnderflowOrOutputOverflow) {
            throw                             inputUnderflowOrOutputOverflow;
            /*
             * (Underflow if the final Character of the final Chunk was incomplete)
             * (Overflow  if the Output Buffer is too small, which SHOULD NOT HAPPEN with our big CB)
             */
        }
    }

    /**
     * Some Debug Writes.
     * 
     * @param debugText
     * @param includeRemainder
     */
    private void debugBytesIn(final String debugText, final boolean includeRemainder) {

        System        .out.print(debugText + " previous=" + this.bytesChunked + " bbPos=" + bb.position() + " Bytes.:");

        for (    int p = 0; p < bb.position();         p++) {
            System    .out.print(" " + Integer.toHexString(0xFF & bb.get(p)));
        }
        if (includeRemainder) {
            System    .out.print(" Remainder.:");

            for (int r = 0; r < this.remainder.length; r++) {
                System.out.print(" " + Integer.toHexString(0xFF & this.remainder[r]));
            }
        }
        System        .out.println();
    }

    public static void main(final String[] args) throws IOException {

        final ChunkedUtf8Decoder utf8Decoder = new ChunkedUtf8Decoder(Path.of("Utf8-Chars_nn.txt"));
        ;                        utf8Decoder.decode();

        System.out.println      (utf8Decoder.cb.flip()); // TODO Note.: flip()
    }
}
枉心 2025-02-19 10:48:58

不是真的:本质上是char []需要从byte [](直接或间接)构建,该(直接或间接)正在备份MappyByteBuffer。

因此,必须采用以下内容:

private static CharBuffer readUTF8(final Path path) throws IOException {

    final long             byteCount  = path.toFile().length();

    final FileChannel      channel    = FileChannel.open(PATH);
    final MappedByteBuffer byteBuffer = channel.map(MapMode.READ_ONLY, 0, byteCount);

    return StandardCharsets.UTF_8.decode(byteBuffer);
}

以下片段演示了重复使用上面示例中使用的各种组件的可能性。
如Javadoc中所述,这是一个复杂的问题。
因此,仅应将其视为思想的集合&amp;根据您的具体要求,各种JDK方法的调用可能会或可能不使用。
在使用以下内容之前,您确实需要深入了解charset等。

/**
 * It is possible to pre-allocate a CharsetDecoder & CharBuffer & use those multiple times.
 * Be aware that this would make your logic MUCH more complicated.
 * a) you'll have to control use/reuse & initialisation of the CharsetDecoder
 * b) you'll need to ensure the CharBuffer has sufficient capacity
 * 
 * The following is just snippets which may be of use.
 */
public static void main(final String[] args) throws Exception {

    final CharsetDecoder   utf8Decoder = StandardCharsets.UTF_8.newDecoder();
    ;                      utf8Decoder.onMalformedInput     (CodingErrorAction.REPLACE);
    ;                      utf8Decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);

    final int              charBufLength    = 0; // TODO how long?
    final CharBuffer       charBuf          = CharBuffer.allocate(charBufLength);

    final int              byteBufferLength = 0; // TODO how long?
    final MappedByteBuffer byteBuffer       = FileChannel.open(Path.of("my File")).map(MapMode.READ_ONLY, 0, byteBufferLength);

    readUTF8(utf8Decoder, byteBuffer, charBuf);
}
private static void readUTF8(final CharsetDecoder utf8Decoder, final MappedByteBuffer byteBuffer, final CharBuffer charBuffer) throws IOException {

    /* Maybe */ utf8Decoder.reset();            // TODO check out the source of the following decode(byteBuffer) method
    /* Maybe */ utf8Decoder.decode(byteBuffer); // TODO check out the source of this method

    utf8Decoder.decode(byteBuffer, charBuffer, true); // reuses utf8Decoder & charBuffer
    charBuffer.flip();
}

Not really: essentially a char[] needs to be built from the byte[] (either direct or indirect) which is backing the MappedByteBuffer.

So something like the following is necessary:

private static CharBuffer readUTF8(final Path path) throws IOException {

    final long             byteCount  = path.toFile().length();

    final FileChannel      channel    = FileChannel.open(PATH);
    final MappedByteBuffer byteBuffer = channel.map(MapMode.READ_ONLY, 0, byteCount);

    return StandardCharsets.UTF_8.decode(byteBuffer);
}

The following snippet demonstrates the possibility to reuse various components used in the above example.
As stated in the Javadoc, this is a complex matter.
So it should only be considered as a collection of ideas & invocations of various JDK methods, which may or may not be of use, depending on your concrete requirements.
You really need an in-depth understanding of Charsets etc. before using the following...

/**
 * It is possible to pre-allocate a CharsetDecoder & CharBuffer & use those multiple times.
 * Be aware that this would make your logic MUCH more complicated.
 * a) you'll have to control use/reuse & initialisation of the CharsetDecoder
 * b) you'll need to ensure the CharBuffer has sufficient capacity
 * 
 * The following is just snippets which may be of use.
 */
public static void main(final String[] args) throws Exception {

    final CharsetDecoder   utf8Decoder = StandardCharsets.UTF_8.newDecoder();
    ;                      utf8Decoder.onMalformedInput     (CodingErrorAction.REPLACE);
    ;                      utf8Decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);

    final int              charBufLength    = 0; // TODO how long?
    final CharBuffer       charBuf          = CharBuffer.allocate(charBufLength);

    final int              byteBufferLength = 0; // TODO how long?
    final MappedByteBuffer byteBuffer       = FileChannel.open(Path.of("my File")).map(MapMode.READ_ONLY, 0, byteBufferLength);

    readUTF8(utf8Decoder, byteBuffer, charBuf);
}
private static void readUTF8(final CharsetDecoder utf8Decoder, final MappedByteBuffer byteBuffer, final CharBuffer charBuffer) throws IOException {

    /* Maybe */ utf8Decoder.reset();            // TODO check out the source of the following decode(byteBuffer) method
    /* Maybe */ utf8Decoder.decode(byteBuffer); // TODO check out the source of this method

    utf8Decoder.decode(byteBuffer, charBuffer, true); // reuses utf8Decoder & charBuffer
    charBuffer.flip();
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文