如何在不复制Java的情况下解码MappedByteBuffer

发布于 2025-02-12 10:48:58 字数 198 浏览 1 评论 0原文

如何在Java中解码mappedbytebuffer的UTF-8代码点，而无需复制缓冲区？将缓冲区复制到内存将打败映射内存的点，并且由于charsetDecoder将要求我使用tochararray方法复制缓冲区，因此它将失败映射的目的内存。有什么方法可以在不复制缓冲区的情况下有效地解码它？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

听风念你 2025-02-19 10:48:58

这是一个更完整的答案。

它读取块中的输入文件＆amp;将结果存储在charbuffer中。

import static java.nio.charset.StandardCharsets.UTF_8;

import java.io.IOException;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.channels.FileChannel;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CoderResult;
import java.nio.charset.MalformedInputException;
import java.nio.charset.UnmappableCharacterException;
import java.nio.file.Files;
import java.nio.file.Path;

public class ChunkedUtf8Decoder {
    /*
     * Valid UTF-8 Multi-Bytes
     * -----------------------
     * 1-Byte : 0xxxxxxx
     * 2-Byte : 110xxxxx + 10xxxxxx
     * 3-Byte : 1110xxxx + 10xxxxxx + 10xxxxxx
     * 4-Byte : 11110xxx + 10xxxxxx + 10xxxxxx + 10xxxxxx
     */
    private static final int            UTF8_CONTINUE_MASK   = 0b11_000000;
    private static final int            UTF8_INVALID_MASK    = 0b11111_000;

    private static final int            UTF8_CONTINUE_PREFIX = 0b10_000000;
    private static final int            UTF8_INVALID_PREFIX  = UTF8_INVALID_MASK;

    private static final int            CHUNK_SIZE           = 24; // TODO Test value! Try something larger, e.g. 64_000

    private        final CharsetDecoder utf8Decoder          = UTF_8.newDecoder();
    private        final ByteBuffer     bb                   = ByteBuffer.allocate(CHUNK_SIZE); // TODO maybe allocateDirect?
    private        final CharBuffer     cb;
    private        final Path           inputPath;

    private              byte[]         remainder            = {};
    private              int            bytesChunked         = 0;

    public ChunkedUtf8Decoder(final Path inputPath) throws IOException {

        this.inputPath = inputPath;
        this.cb        = CharBuffer.allocate(Math.toIntExact(Files.size(inputPath)));

//      this.utf8Decoder.onMalformedInput     (CodingErrorAction.REPLACE); // (TODO Default is REPORT)
//      this.utf8Decoder.onUnmappableCharacter(CodingErrorAction.REPLACE); // (TODO Default is REPORT)
    }

    /**
     * Split the Input-File into Chunks & Decode them, appending the result to our CharBuffer.
     * 
     * @throws IOException
     */
    public void decode() throws IOException {

        try(final FileChannel  channel = FileChannel.open(inputPath))
        {
            while (channel.read(bb) != -1) {

                debugBytesIn("Read......:", false);

                this.remainder    =  backupToCharBoundary();

                debugBytesIn("Backed up.:", true);

                this.bytesChunked += decodeChunk();

                bb.clear();             // (position=0, limit=capacity, mark=-1)
                bb.put(this.remainder); // (Final Remainder is dealt with below)
            }
            debugBytesIn("EOF.......:", false);
            /*
             * Lastly, deal with Final Remainder (0 to 4 Bytes) @ start of Buffer...
             */
            decodeChunk();
        }
    }

    /**
     * We try to back up the BB to a Character boundary.
     * If the file is correctly encoded, the shorter BB will then decode OK.
     * The Remainder will be processed after the NEXT Read operation.
     * 
     * @return
     * @throws MalformedInputException
     */
    private byte[] backupToCharBoundary() throws MalformedInputException {

        for (int i = 1; i <= 4; i++) {

            final int trailingBytePos = bb.position() - i;

            if (trailingBytePos < 0) {
                /*
                 *  If there were too few Bytes available, carry them over until either
                 *  a) more Bytes become available, or
                 *  b) EOF is reached
                 */
                final byte[] remainder = new byte[bb.position()];

                bb.get  (0,  remainder);
                bb.clear();       // (position=0, limit=capacity, mark=-1)

                return remainder; // (Entire contents of BB in Remainder, BB Empty)
            }

            final int trailingByte = 0xFF & bb.get(trailingBytePos);

            /*
             * We stop as soon as we encounter a Character-Selector...
             * (The following 2 intervals are either invalid or a continuation-character)
             */
            if ((trailingByte  &  UTF8_INVALID_MASK )  !=  UTF8_INVALID_PREFIX     // 11111xxx
            &&  (trailingByte  &  UTF8_CONTINUE_MASK)  !=  UTF8_CONTINUE_PREFIX) { // 10xxxxxx
                /*
                 *  OK, we have found a (1-, 2-, 3- or 4-Byte) Character-Selector...
                 */
                final byte[] remainder = new byte[i];

                bb.get     (trailingBytePos, remainder);
                bb.position(trailingBytePos);

                return remainder; // (Trailing 1-4 Bytes of BB in Remainder)
            }
        }
        /*
         * Although at least 4 Bytes are available, We couldn't find a Character-Selector!
         * This is an error.
         * We leave the ByteBuffer unchanged & return an empty Remainder.
         * The CharsetDecoder will produce a MalformedInputException in due course...
         */
        return new byte[0];
    }

    /**
     * Decode the current Chunk of Bytes & append the result to our CB.
     * 
     * @return
     * @throws CharacterCodingException
     */
    private int decodeChunk() throws CharacterCodingException {
        try {
            this.bb.flip(); // (limit=position, position=0, mark=-1)

            ;                       this.utf8Decoder.reset();
            final CoderResult res = this.utf8Decoder.decode(this.bb, this.cb, true);

            System.out.println("CB Chars..: " + this.cb.position());
            /*
             * Why on earth is UNDERFLOW the correct result of an error-free Decode?!
             */
            if (res.isUnderflow() == false) {
                res.throwException();
            }
            return this.bb.position();
        }
        catch (final  MalformedInputException e) {
            throw new MalformedInputException     (this.bytesChunked + e.getInputLength());
        }
        catch (final  UnmappableCharacterException e) {
            throw new UnmappableCharacterException(this.bytesChunked + e.getInputLength());
            /*
             * (Famous Last Words "UnmappableCharacterException never occurs with UTF-8 Decoder")
             */
        }
        catch (final CharacterCodingException inputUnderflowOrOutputOverflow) {
            throw                             inputUnderflowOrOutputOverflow;
            /*
             * (Underflow if the final Character of the final Chunk was incomplete)
             * (Overflow  if the Output Buffer is too small, which SHOULD NOT HAPPEN with our big CB)
             */
        }
    }

    /**
     * Some Debug Writes.
     * 
     * @param debugText
     * @param includeRemainder
     */
    private void debugBytesIn(final String debugText, final boolean includeRemainder) {

        System        .out.print(debugText + " previous=" + this.bytesChunked + " bbPos=" + bb.position() + " Bytes.:");

        for (    int p = 0; p < bb.position();         p++) {
            System    .out.print(" " + Integer.toHexString(0xFF & bb.get(p)));
        }
        if (includeRemainder) {
            System    .out.print(" Remainder.:");

            for (int r = 0; r < this.remainder.length; r++) {
                System.out.print(" " + Integer.toHexString(0xFF & this.remainder[r]));
            }
        }
        System        .out.println();
    }

    public static void main(final String[] args) throws IOException {

        final ChunkedUtf8Decoder utf8Decoder = new ChunkedUtf8Decoder(Path.of("Utf8-Chars_nn.txt"));
        ;                        utf8Decoder.decode();

        System.out.println      (utf8Decoder.cb.flip()); // TODO Note.: flip()
    }
}

Here is a somewhat more complete answer.

It reads the Input-File in Chunks & stores the result in a CharBuffer.

import static java.nio.charset.StandardCharsets.UTF_8;

import java.io.IOException;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.channels.FileChannel;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CoderResult;
import java.nio.charset.MalformedInputException;
import java.nio.charset.UnmappableCharacterException;
import java.nio.file.Files;
import java.nio.file.Path;

public class ChunkedUtf8Decoder {
    /*
     * Valid UTF-8 Multi-Bytes
     * -----------------------
     * 1-Byte : 0xxxxxxx
     * 2-Byte : 110xxxxx + 10xxxxxx
     * 3-Byte : 1110xxxx + 10xxxxxx + 10xxxxxx
     * 4-Byte : 11110xxx + 10xxxxxx + 10xxxxxx + 10xxxxxx
     */
    private static final int            UTF8_CONTINUE_MASK   = 0b11_000000;
    private static final int            UTF8_INVALID_MASK    = 0b11111_000;

    private static final int            UTF8_CONTINUE_PREFIX = 0b10_000000;
    private static final int            UTF8_INVALID_PREFIX  = UTF8_INVALID_MASK;

    private static final int            CHUNK_SIZE           = 24; // TODO Test value! Try something larger, e.g. 64_000

    private        final CharsetDecoder utf8Decoder          = UTF_8.newDecoder();
    private        final ByteBuffer     bb                   = ByteBuffer.allocate(CHUNK_SIZE); // TODO maybe allocateDirect?
    private        final CharBuffer     cb;
    private        final Path           inputPath;

    private              byte[]         remainder            = {};
    private              int            bytesChunked         = 0;

    public ChunkedUtf8Decoder(final Path inputPath) throws IOException {

        this.inputPath = inputPath;
        this.cb        = CharBuffer.allocate(Math.toIntExact(Files.size(inputPath)));

//      this.utf8Decoder.onMalformedInput     (CodingErrorAction.REPLACE); // (TODO Default is REPORT)
//      this.utf8Decoder.onUnmappableCharacter(CodingErrorAction.REPLACE); // (TODO Default is REPORT)
    }

    /**
     * Split the Input-File into Chunks & Decode them, appending the result to our CharBuffer.
     * 
     * @throws IOException
     */
    public void decode() throws IOException {

        try(final FileChannel  channel = FileChannel.open(inputPath))
        {
            while (channel.read(bb) != -1) {

                debugBytesIn("Read......:", false);

                this.remainder    =  backupToCharBoundary();

                debugBytesIn("Backed up.:", true);

                this.bytesChunked += decodeChunk();

                bb.clear();             // (position=0, limit=capacity, mark=-1)
                bb.put(this.remainder); // (Final Remainder is dealt with below)
            }
            debugBytesIn("EOF.......:", false);
            /*
             * Lastly, deal with Final Remainder (0 to 4 Bytes) @ start of Buffer...
             */
            decodeChunk();
        }
    }

    /**
     * We try to back up the BB to a Character boundary.
     * If the file is correctly encoded, the shorter BB will then decode OK.
     * The Remainder will be processed after the NEXT Read operation.
     * 
     * @return
     * @throws MalformedInputException
     */
    private byte[] backupToCharBoundary() throws MalformedInputException {

        for (int i = 1; i <= 4; i++) {

            final int trailingBytePos = bb.position() - i;

            if (trailingBytePos < 0) {
                /*
                 *  If there were too few Bytes available, carry them over until either
                 *  a) more Bytes become available, or
                 *  b) EOF is reached
                 */
                final byte[] remainder = new byte[bb.position()];

                bb.get  (0,  remainder);
                bb.clear();       // (position=0, limit=capacity, mark=-1)

                return remainder; // (Entire contents of BB in Remainder, BB Empty)
            }

            final int trailingByte = 0xFF & bb.get(trailingBytePos);

            /*
             * We stop as soon as we encounter a Character-Selector...
             * (The following 2 intervals are either invalid or a continuation-character)
             */
            if ((trailingByte  &  UTF8_INVALID_MASK )  !=  UTF8_INVALID_PREFIX     // 11111xxx
            &&  (trailingByte  &  UTF8_CONTINUE_MASK)  !=  UTF8_CONTINUE_PREFIX) { // 10xxxxxx
                /*
                 *  OK, we have found a (1-, 2-, 3- or 4-Byte) Character-Selector...
                 */
                final byte[] remainder = new byte[i];

                bb.get     (trailingBytePos, remainder);
                bb.position(trailingBytePos);

                return remainder; // (Trailing 1-4 Bytes of BB in Remainder)
            }
        }
        /*
         * Although at least 4 Bytes are available, We couldn't find a Character-Selector!
         * This is an error.
         * We leave the ByteBuffer unchanged & return an empty Remainder.
         * The CharsetDecoder will produce a MalformedInputException in due course...
         */
        return new byte[0];
    }

    /**
     * Decode the current Chunk of Bytes & append the result to our CB.
     * 
     * @return
     * @throws CharacterCodingException
     */
    private int decodeChunk() throws CharacterCodingException {
        try {
            this.bb.flip(); // (limit=position, position=0, mark=-1)

            ;                       this.utf8Decoder.reset();
            final CoderResult res = this.utf8Decoder.decode(this.bb, this.cb, true);

            System.out.println("CB Chars..: " + this.cb.position());
            /*
             * Why on earth is UNDERFLOW the correct result of an error-free Decode?!
             */
            if (res.isUnderflow() == false) {
                res.throwException();
            }
            return this.bb.position();
        }
        catch (final  MalformedInputException e) {
            throw new MalformedInputException     (this.bytesChunked + e.getInputLength());
        }
        catch (final  UnmappableCharacterException e) {
            throw new UnmappableCharacterException(this.bytesChunked + e.getInputLength());
            /*
             * (Famous Last Words "UnmappableCharacterException never occurs with UTF-8 Decoder")
             */
        }
        catch (final CharacterCodingException inputUnderflowOrOutputOverflow) {
            throw                             inputUnderflowOrOutputOverflow;
            /*
             * (Underflow if the final Character of the final Chunk was incomplete)
             * (Overflow  if the Output Buffer is too small, which SHOULD NOT HAPPEN with our big CB)
             */
        }
    }

    /**
     * Some Debug Writes.
     * 
     * @param debugText
     * @param includeRemainder
     */
    private void debugBytesIn(final String debugText, final boolean includeRemainder) {

        System        .out.print(debugText + " previous=" + this.bytesChunked + " bbPos=" + bb.position() + " Bytes.:");

        for (    int p = 0; p < bb.position();         p++) {
            System    .out.print(" " + Integer.toHexString(0xFF & bb.get(p)));
        }
        if (includeRemainder) {
            System    .out.print(" Remainder.:");

            for (int r = 0; r < this.remainder.length; r++) {
                System.out.print(" " + Integer.toHexString(0xFF & this.remainder[r]));
            }
        }
        System        .out.println();
    }

    public static void main(final String[] args) throws IOException {

        final ChunkedUtf8Decoder utf8Decoder = new ChunkedUtf8Decoder(Path.of("Utf8-Chars_nn.txt"));
        ;                        utf8Decoder.decode();

        System.out.println      (utf8Decoder.cb.flip()); // TODO Note.: flip()
    }
}

回复收藏 0 原文

枉心 2025-02-19 10:48:58

不是真的：本质上是char []需要从byte []（直接或间接）构建，该（直接或间接）正在备份MappyByteBuffer。

因此，必须采用以下内容：

private static CharBuffer readUTF8(final Path path) throws IOException {

    final long             byteCount  = path.toFile().length();

    final FileChannel      channel    = FileChannel.open(PATH);
    final MappedByteBuffer byteBuffer = channel.map(MapMode.READ_ONLY, 0, byteCount);

    return StandardCharsets.UTF_8.decode(byteBuffer);
}

以下片段演示了重复使用上面示例中使用的各种组件的可能性。
如Javadoc中所述，这是一个复杂的问题。
因此，仅应将其视为思想的集合＆amp;根据您的具体要求，各种JDK方法的调用可能会或可能不使用。
在使用以下内容之前，您确实需要深入了解charset等。

/**
 * It is possible to pre-allocate a CharsetDecoder & CharBuffer & use those multiple times.
 * Be aware that this would make your logic MUCH more complicated.
 * a) you'll have to control use/reuse & initialisation of the CharsetDecoder
 * b) you'll need to ensure the CharBuffer has sufficient capacity
 * 
 * The following is just snippets which may be of use.
 */
public static void main(final String[] args) throws Exception {

    final CharsetDecoder   utf8Decoder = StandardCharsets.UTF_8.newDecoder();
    ;                      utf8Decoder.onMalformedInput     (CodingErrorAction.REPLACE);
    ;                      utf8Decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);

    final int              charBufLength    = 0; // TODO how long?
    final CharBuffer       charBuf          = CharBuffer.allocate(charBufLength);

    final int              byteBufferLength = 0; // TODO how long?
    final MappedByteBuffer byteBuffer       = FileChannel.open(Path.of("my File")).map(MapMode.READ_ONLY, 0, byteBufferLength);

    readUTF8(utf8Decoder, byteBuffer, charBuf);
}
private static void readUTF8(final CharsetDecoder utf8Decoder, final MappedByteBuffer byteBuffer, final CharBuffer charBuffer) throws IOException {

    /* Maybe */ utf8Decoder.reset();            // TODO check out the source of the following decode(byteBuffer) method
    /* Maybe */ utf8Decoder.decode(byteBuffer); // TODO check out the source of this method

    utf8Decoder.decode(byteBuffer, charBuffer, true); // reuses utf8Decoder & charBuffer
    charBuffer.flip();
}

Not really: essentially a char[] needs to be built from the byte[] (either direct or indirect) which is backing the MappedByteBuffer.

So something like the following is necessary:

private static CharBuffer readUTF8(final Path path) throws IOException {

    final long             byteCount  = path.toFile().length();

    final FileChannel      channel    = FileChannel.open(PATH);
    final MappedByteBuffer byteBuffer = channel.map(MapMode.READ_ONLY, 0, byteCount);

    return StandardCharsets.UTF_8.decode(byteBuffer);
}

The following snippet demonstrates the possibility to reuse various components used in the above example.
As stated in the Javadoc, this is a complex matter.
So it should only be considered as a collection of ideas & invocations of various JDK methods, which may or may not be of use, depending on your concrete requirements.
You really need an in-depth understanding of Charsets etc. before using the following...

/**
 * It is possible to pre-allocate a CharsetDecoder & CharBuffer & use those multiple times.
 * Be aware that this would make your logic MUCH more complicated.
 * a) you'll have to control use/reuse & initialisation of the CharsetDecoder
 * b) you'll need to ensure the CharBuffer has sufficient capacity
 * 
 * The following is just snippets which may be of use.
 */
public static void main(final String[] args) throws Exception {

    final CharsetDecoder   utf8Decoder = StandardCharsets.UTF_8.newDecoder();
    ;                      utf8Decoder.onMalformedInput     (CodingErrorAction.REPLACE);
    ;                      utf8Decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);

    final int              charBufLength    = 0; // TODO how long?
    final CharBuffer       charBuf          = CharBuffer.allocate(charBufLength);

    final int              byteBufferLength = 0; // TODO how long?
    final MappedByteBuffer byteBuffer       = FileChannel.open(Path.of("my File")).map(MapMode.READ_ONLY, 0, byteBufferLength);

    readUTF8(utf8Decoder, byteBuffer, charBuf);
}
private static void readUTF8(final CharsetDecoder utf8Decoder, final MappedByteBuffer byteBuffer, final CharBuffer charBuffer) throws IOException {

    /* Maybe */ utf8Decoder.reset();            // TODO check out the source of the following decode(byteBuffer) method
    /* Maybe */ utf8Decoder.decode(byteBuffer); // TODO check out the source of this method

    utf8Decoder.decode(byteBuffer, charBuffer, true); // reuses utf8Decoder & charBuffer
    charBuffer.flip();
}

回复收藏 0 原文

~没有更多了~