Java:内存高效的 ByteArrayOutputStream

发布于 2024-12-02 06:50:44 字数 576 浏览 4 评论 0原文

我的磁盘中有一个 40MB 的文件,我需要使用字节数组将其“映射”到内存中。

起初,我认为将文件写入 ByteArrayOutputStream 将是最好的方法,但我发现在复制操作期间的某个时刻需要大约 160MB 的堆空间。

有人知道更好的方法来做到这一点而不使用三倍的 RAM 文件大小吗?

更新:感谢您的回答。我注意到我可以减少内存消耗,告诉 ByteArrayOutputStream 初始大小比原始文件大小稍大一些(在我的代码中使用确切的大小强制重新分配,必须检查原因)。

还有另一个高内存点:当我使用 ByteArrayOutputStream.toByteArray 返回 byte[] 时。查看其源代码,我可以看到它正在克隆数组:

public synchronized byte toByteArray()[] {
    return Arrays.copyOf(buf, count);
}

我想我可以扩展 ByteArrayOutputStream 并重写此方法,以便直接返回原始数组。鉴于流和字节数组不会被多次使用,这里是否存在任何潜在的危险?

I've got a 40MB file in the disk and I need to "map" it into memory using a byte array.

At first, I thought writing the file to a ByteArrayOutputStream would be the best way, but I find it takes about 160MB of heap space at some moment during the copy operation.

Does somebody know a better way to do this without using three times the file size of RAM?

Update: Thanks for your answers. I noticed I could reduce memory consumption a little telling ByteArrayOutputStream initial size to be a bit greater than the original file size (using the exact size with my code forces reallocation, got to check why).

There's another high memory spot: when I get byte[] back with ByteArrayOutputStream.toByteArray. Taking a look to its source code, I can see it is cloning the array:

public synchronized byte toByteArray()[] {
    return Arrays.copyOf(buf, count);
}

I'm thinking I could just extend ByteArrayOutputStream and rewrite this method, so to return the original array directly. Is there any potential danger here, given the stream and the byte array won't be used more than once?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

拍不死你 2024-12-09 06:50:44

MappedByteBuffer可能就是您正在寻找的。

不过,我很惊讶读取内存中的文件需要这么多内存。您是否已构建具有适当容量的 ByteArrayOutputStream ?如果没有,流可以在接近 40 MB 末尾时分配一个新的字节数组,这意味着您将拥有 39MB 的完整缓冲区和两倍大小的新缓冲区。然而,如果流具有适当的容量,则不会有任何重新分配(更快),并且不会浪费内存。

MappedByteBuffer might be what you're looking for.

I'm surprised it takes so much RAM to read a file in memory, though. Have you constructed the ByteArrayOutputStream with an appropriate capacity? If you haven't, the stream could allocate a new byte array when it's near the end of the 40 MB, meaning that you would, for example, have a full buffer of 39MB, and a new buffer of twice the size. Whereas if the stream has the appropriate capacity, there won't be any reallocation (faster), and no wasted memory.

卸妝后依然美 2024-12-09 06:50:44

只要您在构造函数中指定适当的大小,ByteArrayOutputStream 就应该没问题。当您调用 toByteArray 时,它仍然会创建一个副本,但这只是临时的。您真的介意内存短暂增加很多吗?

或者,如果您已经知道开始的大小,则可以创建一个字节数组并重复从 FileInputStream 读取到该缓冲区,直到获得所有数据。

ByteArrayOutputStream should be okay so long as you specify an appropriate size in the constructor. It will still create a copy when you call toByteArray, but that's only temporary. Do you really mind the memory briefly going up a lot?

Alternatively, if you already know the size to start with you can just create a byte array and repeatedly read from a FileInputStream into that buffer until you've got all the data.

忆沫 2024-12-09 06:50:44

如果您确实想将文件映射到内存中,那么FileChannel 是适当的机制。

如果您只想将文件读入一个简单的 byte[] (并且不需要对该数组进行更改以反映回文件),那么只需读入一个适当大小的来自普通 byte[] FileInputStream 应该够了。

Guava 具有 Files.toByteArray() 其中做所有给你的。

If you really want to map the file into memory, then a FileChannel is the appropriate mechanism.

If all you want to do is read the file into a simple byte[] (and don't need changes to that array to be reflected back to the file), then simply reading into an appropriately-sized byte[] from a normal FileInputStream should suffice.

Guava has Files.toByteArray() which does all that for you.

深巷少女 2024-12-09 06:50:44

有关 ByteArrayOutputStream 的缓冲区增长行为的说明,请阅读此答案

在回答您的问题时,扩展 ByteArrayOutputStream 是安全的。在您的情况下,最好重写写入方法,将最大额外分配限制为 16MB。您不应重写 toByteArray 来公开受保护的 buf[] 成员。这是因为流不是缓冲区;而是流。流是具有位置指针和边界保护的缓冲区。因此,从类外部访问并可能操作缓冲区是危险的。

For an explanation of the buffer growth behavior of ByteArrayOutputStream, please read this answer.

In answer to your question, it is safe to extend ByteArrayOutputStream. In your situation, it is probably better to override the write methods such that the maximum additional allocation is limited, say, to 16MB. You should not override the toByteArray to expose the protected buf[] member. This is because a stream is not a buffer; A stream is a buffer that has a position pointer and boundary protection. So, it is dangerous to access and potentially manipulate the buffer from outside the class.

静待花开 2024-12-09 06:50:44

我想我可以扩展 ByteArrayOutputStream 并重写此方法,以便直接返回原始数组。鉴于流和字节数组不会被多次使用,这里是否有任何潜在的危险?

您不应该更改现有方法的指定行为,但添加新方法是完全可以的。这是一个实现:

/** Subclasses ByteArrayOutputStream to give access to the internal raw buffer. */
public class ByteArrayOutputStream2 extends java.io.ByteArrayOutputStream {
    public ByteArrayOutputStream2() { super(); }
    public ByteArrayOutputStream2(int size) { super(size); }

    /** Returns the internal buffer of this ByteArrayOutputStream, without copying. */
    public synchronized byte[] buf() {
        return this.buf;
    }
}

any ByteArrayOutputStream 获取缓冲区的另一种方法是使用它的 writeTo(OutputStream) 方法将缓冲区直接传递给提供的OutputStream:(

/**
 * Returns the internal raw buffer of a ByteArrayOutputStream, without copying.
 */
public static byte[] getBuffer(ByteArrayOutputStream bout) {
    final byte[][] result = new byte[1][];
    try {
        bout.writeTo(new OutputStream() {
            @Override
            public void write(byte[] buf, int offset, int length) {
                result[0] = buf;
            }

            @Override
            public void write(int b) {}
        });
    } catch (IOException e) {
        throw new RuntimeException(e);
    }
    return result[0];
}

这有效,但我不确定它是否有用,因为子类化 ByteArrayOutputStream 更简单。)

但是,从你问题的其余部分来看,听起来你想要的只是一个简单的 byte[] 文件的完整内容。从 Java 7 开始,最简单、最快的方法是调用 Files.readAllBytes。在 Java 6 及更低版本中,您可以使用 DataInputStream.readFully,如 Peter Lawrey 的回答 中所示。无论哪种方式,您都将获得一个以正确大小分配一次的数组,而无需重复重新分配 ByteArrayOutputStream。

I'm thinking I could just extend ByteArrayOutputStream and rewrite this method, so to return the original array directly. Is there any potential danger here, given the stream and the byte array won't be used more than once?

You shouldn't change the specified behavior of the existing method, but it's perfectly fine to add a new method. Here's an implementation:

/** Subclasses ByteArrayOutputStream to give access to the internal raw buffer. */
public class ByteArrayOutputStream2 extends java.io.ByteArrayOutputStream {
    public ByteArrayOutputStream2() { super(); }
    public ByteArrayOutputStream2(int size) { super(size); }

    /** Returns the internal buffer of this ByteArrayOutputStream, without copying. */
    public synchronized byte[] buf() {
        return this.buf;
    }
}

An alternative but hackish way of getting the buffer from any ByteArrayOutputStream is to use the fact that its writeTo(OutputStream) method passes the buffer directly to the provided OutputStream:

/**
 * Returns the internal raw buffer of a ByteArrayOutputStream, without copying.
 */
public static byte[] getBuffer(ByteArrayOutputStream bout) {
    final byte[][] result = new byte[1][];
    try {
        bout.writeTo(new OutputStream() {
            @Override
            public void write(byte[] buf, int offset, int length) {
                result[0] = buf;
            }

            @Override
            public void write(int b) {}
        });
    } catch (IOException e) {
        throw new RuntimeException(e);
    }
    return result[0];
}

(That works, but I'm not sure if it's useful, given that subclassing ByteArrayOutputStream is simpler.)

However, from the rest of your question it sounds like all you want is a plain byte[] of the complete contents of the file. As of Java 7, the simplest and fastest way to do that is call Files.readAllBytes. In Java 6 and below, you can use DataInputStream.readFully, as in Peter Lawrey's answer. Either way, you will get an array that is allocated once at the correct size, without the repeated reallocation of ByteArrayOutputStream.

清眉祭 2024-12-09 06:50:44

如果您有 40 MB 的数据,我看不出有什么理由需要超过 40 MB 的数据来创建一个 byte[]。我假设您正在使用不断增长的 ByteArrayOutputStream,它在完成时创建一个 byte[] 副本。

您可以尝试旧的立即读取文件的方法。

File file = 
DataInputStream is = new DataInputStream(FileInputStream(file));
byte[] bytes = new byte[(int) file.length()];
is.readFully(bytes);
is.close();

如果您可以直接使用 ByteBuffer,那么使用 MappedByteBuffer 会更有效,并且可以避免数据副本(或大量使用堆),但是如果您必须使用 byte[],则它不太有帮助。

If you have 40 MB of data I don't see any reason why it would take more than 40 MB to create a byte[]. I assume you are using a growing ByteArrayOutputStream which creates a byte[] copy when finished.

You can try the old read the file at once approach.

File file = 
DataInputStream is = new DataInputStream(FileInputStream(file));
byte[] bytes = new byte[(int) file.length()];
is.readFully(bytes);
is.close();

Using a MappedByteBuffer is more efficient and avoids a copy of data (or using the heap much) provided you can use the ByteBuffer directly, however if you have to use a byte[] its unlikely to help much.

提赋 2024-12-09 06:50:44

...但我发现在复制操作期间的某个时刻大约需要160MB的堆空间

我发现这非常令人惊讶...以至于我怀疑您是否正确测量了堆使用情况。

让我们假设您的代码是这样的:

BufferedInputStream bis = new BufferedInputStream(
        new FileInputStream("somefile"));
ByteArrayOutputStream baos = new ByteArrayOutputStream();  /* no hint !! */

int b;
while ((b = bis.read()) != -1) {
    baos.write((byte) b);
}
byte[] stuff = baos.toByteArray();

现在 ByteArrayOutputStream 管理其缓冲区的方式是分配初始大小,并且(至少)在缓冲区填满时将其加倍。因此,在最坏的情况下,baos 可能会使用最多 80Mb 的缓冲区来保存 40Mb 的文件。

最后一步分配一个恰好包含 baos.size() 字节的新数组来保存缓冲区的内容。那是 40Mb。因此实际使用的内存峰值量应该是 120Mb。

那么这额外的 40Mb 被用在哪里了呢?我的猜测是它们不是,并且您实际上报告的是总堆大小,而不是可访问对象占用的内存量。


那么解决办法是什么呢?

  1. 您可以使用内存映射缓冲区。

  2. 您可以在分配ByteArrayOutputStream时给出大小提示;例如

     ByteArrayOutputStream baos = ByteArrayOutputStream(file.size());
    
  3. 您可以完全放弃ByteArrayOutputStream并直接读入字节数组。

     byte[] buffer = new byte[file.size()];
     FileInputStream fis = new FileInputStream(文件);
     int nosRead = fis.read(buffer);
     /* 检查 nosRead == buffer.length 并在必要时重复 */
    

读取 40Mb 文件时,选项 1 和 2 的峰值内存使用量应为 40Mb;即没有浪费空间。


如果您发布代码并描述测量内存使用情况的方法,将会很有帮助。


我想我可以扩展 ByteArrayOutputStream 并重写此方法,以便直接返回原始数组。鉴于流和字节数组不会被多次使用,这里是否有任何潜在的危险?

潜在的危险是您的假设不正确,或者由于其他人无意中修改了您的代码而变得不正确......

... but I find it takes about 160MB of heap space at some moment during the copy operation

I find this extremely surprising ... to the extent that I have my doubts that you are measuring the heap usage correctly.

Let's assume that your code is something like this:

BufferedInputStream bis = new BufferedInputStream(
        new FileInputStream("somefile"));
ByteArrayOutputStream baos = new ByteArrayOutputStream();  /* no hint !! */

int b;
while ((b = bis.read()) != -1) {
    baos.write((byte) b);
}
byte[] stuff = baos.toByteArray();

Now the way that a ByteArrayOutputStream manages its buffer is to allocate an initial size, and (at least) double the buffer when it fills it up. Thus, in the worst case baos might use up to 80Mb buffer to hold a 40Mb file.

The final step allocates a new array of exactly baos.size() bytes to hold the buffer's contents. That's 40Mb. So the peak amount of memory that is actually in use should be 120Mb.

So where are those extra 40Mb being used? My guess is that they are not, and that you are actually reporting the total heap size, not the amount of memory that is occupied by reachable objects.


So what is the solution?

  1. You could use a memory mapped buffer.

  2. You could give a size hint when you allocate the ByteArrayOutputStream; e.g.

     ByteArrayOutputStream baos = ByteArrayOutputStream(file.size());
    
  3. You could dispense with the ByteArrayOutputStream entirely and read directly into a byte array.

     byte[] buffer = new byte[file.size()];
     FileInputStream fis = new FileInputStream(file);
     int nosRead = fis.read(buffer);
     /* check that nosRead == buffer.length and repeat if necessary */
    

Both options 1 and 2 should have an peak memory usage of 40Mb while reading a 40Mb file; i.e. no wasted space.


It would be helpful if you posted your code, and described your methodology for measuring memory usage.


I'm thinking I could just extend ByteArrayOutputStream and rewrite this method, so to return the original array directly. Is there any potential danger here, given the stream and the byte array won't be used more than once?

The potential danger is that your assumptions are incorrect, or become incorrect due to someone else modifying your code unwittingly ...

遇到 2024-12-09 06:50:44

Google Guava ByteSource 似乎是在内存中缓冲的不错选择。与 ByteArrayOutputStream 或 ByteArrayList(来自 Colt 库)等实现不同,它不会将数据合并到一个巨大的字节数组中,而是单独存储每个块。示例:

List<ByteSource> result = new ArrayList<>();
try (InputStream source = httpRequest.getInputStream()) {
    byte[] cbuf = new byte[CHUNK_SIZE];
    while (true) {
        int read = source.read(cbuf);
        if (read == -1) {
            break;
        } else {
            result.add(ByteSource.wrap(Arrays.copyOf(cbuf, read)));
        }
    }
}
ByteSource body = ByteSource.concat(result);

ByteSource 可以随时作为 InputStream 读取:

InputStream data = body.openBufferedStream();

Google Guava ByteSource seems to be a good choice for buffering in memory. Unlike implementations like ByteArrayOutputStream or ByteArrayList(from Colt Library) it does not merge the data into a huge byte array but stores every chunk separately. An example:

List<ByteSource> result = new ArrayList<>();
try (InputStream source = httpRequest.getInputStream()) {
    byte[] cbuf = new byte[CHUNK_SIZE];
    while (true) {
        int read = source.read(cbuf);
        if (read == -1) {
            break;
        } else {
            result.add(ByteSource.wrap(Arrays.copyOf(cbuf, read)));
        }
    }
}
ByteSource body = ByteSource.concat(result);

The ByteSource can be read as an InputStream anytime later:

InputStream data = body.openBufferedStream();
入怼 2024-12-09 06:50:44

...在读取 1GB 文件时,我们得到了同样的观察结果:Oracle 的 ByteArrayOutputStream 有一个惰性内存管理。
字节数组由 int 索引,无论如何限制为 2GB。如果不依赖第三方,您可能会发现这很有用:

static public byte[] getBinFileContent(String aFile) 
{
    try
    {
        final int bufLen = 32768;
        final long fs = new File(aFile).length();
        final long maxInt = ((long) 1 << 31) - 1;
        if (fs > maxInt)
        {
            System.err.println("file size out of range");
            return null;
        }
        final byte[] res = new byte[(int) fs];
        final byte[] buffer = new byte[bufLen];
        final InputStream is = new FileInputStream(aFile);
        int n;
        int pos = 0;
        while ((n = is.read(buffer)) > 0)
        {
            System.arraycopy(buffer, 0, res, pos, n);
            pos += n;
        }
        is.close();
        return res;
    }
    catch (final IOException e)
    {
        e.printStackTrace();
        return null;
    }
    catch (final OutOfMemoryError e)
    {
        e.printStackTrace();
        return null;
    }
}

... came here with the same observation when reading a 1GB file: Oracle's ByteArrayOutputStream has a lazy memory management.
A byte-Array is indexed by an int and such anyway limited to 2GB. Without dependency on 3rd-party you might find this useful:

static public byte[] getBinFileContent(String aFile) 
{
    try
    {
        final int bufLen = 32768;
        final long fs = new File(aFile).length();
        final long maxInt = ((long) 1 << 31) - 1;
        if (fs > maxInt)
        {
            System.err.println("file size out of range");
            return null;
        }
        final byte[] res = new byte[(int) fs];
        final byte[] buffer = new byte[bufLen];
        final InputStream is = new FileInputStream(aFile);
        int n;
        int pos = 0;
        while ((n = is.read(buffer)) > 0)
        {
            System.arraycopy(buffer, 0, res, pos, n);
            pos += n;
        }
        is.close();
        return res;
    }
    catch (final IOException e)
    {
        e.printStackTrace();
        return null;
    }
    catch (final OutOfMemoryError e)
    {
        e.printStackTrace();
        return null;
    }
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文