Java:内存高效的 ByteArrayOutputStream
我的磁盘中有一个 40MB 的文件,我需要使用字节数组将其“映射”到内存中。
起初,我认为将文件写入 ByteArrayOutputStream 将是最好的方法,但我发现在复制操作期间的某个时刻需要大约 160MB 的堆空间。
有人知道更好的方法来做到这一点而不使用三倍的 RAM 文件大小吗?
更新:感谢您的回答。我注意到我可以减少内存消耗,告诉 ByteArrayOutputStream 初始大小比原始文件大小稍大一些(在我的代码中使用确切的大小强制重新分配,必须检查原因)。
还有另一个高内存点:当我使用 ByteArrayOutputStream.toByteArray 返回 byte[] 时。查看其源代码,我可以看到它正在克隆数组:
public synchronized byte toByteArray()[] {
return Arrays.copyOf(buf, count);
}
我想我可以扩展 ByteArrayOutputStream 并重写此方法,以便直接返回原始数组。鉴于流和字节数组不会被多次使用,这里是否存在任何潜在的危险?
I've got a 40MB file in the disk and I need to "map" it into memory using a byte array.
At first, I thought writing the file to a ByteArrayOutputStream would be the best way, but I find it takes about 160MB of heap space at some moment during the copy operation.
Does somebody know a better way to do this without using three times the file size of RAM?
Update: Thanks for your answers. I noticed I could reduce memory consumption a little telling ByteArrayOutputStream initial size to be a bit greater than the original file size (using the exact size with my code forces reallocation, got to check why).
There's another high memory spot: when I get byte[] back with ByteArrayOutputStream.toByteArray. Taking a look to its source code, I can see it is cloning the array:
public synchronized byte toByteArray()[] {
return Arrays.copyOf(buf, count);
}
I'm thinking I could just extend ByteArrayOutputStream and rewrite this method, so to return the original array directly. Is there any potential danger here, given the stream and the byte array won't be used more than once?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
MappedByteBuffer
可能就是您正在寻找的。不过,我很惊讶读取内存中的文件需要这么多内存。您是否已构建具有适当容量的 ByteArrayOutputStream ?如果没有,流可以在接近 40 MB 末尾时分配一个新的字节数组,这意味着您将拥有 39MB 的完整缓冲区和两倍大小的新缓冲区。然而,如果流具有适当的容量,则不会有任何重新分配(更快),并且不会浪费内存。
MappedByteBuffer
might be what you're looking for.I'm surprised it takes so much RAM to read a file in memory, though. Have you constructed the
ByteArrayOutputStream
with an appropriate capacity? If you haven't, the stream could allocate a new byte array when it's near the end of the 40 MB, meaning that you would, for example, have a full buffer of 39MB, and a new buffer of twice the size. Whereas if the stream has the appropriate capacity, there won't be any reallocation (faster), and no wasted memory.只要您在构造函数中指定适当的大小,
ByteArrayOutputStream
就应该没问题。当您调用 toByteArray 时,它仍然会创建一个副本,但这只是临时的。您真的介意内存短暂增加很多吗?或者,如果您已经知道开始的大小,则可以创建一个字节数组并重复从 FileInputStream 读取到该缓冲区,直到获得所有数据。
ByteArrayOutputStream
should be okay so long as you specify an appropriate size in the constructor. It will still create a copy when you calltoByteArray
, but that's only temporary. Do you really mind the memory briefly going up a lot?Alternatively, if you already know the size to start with you can just create a byte array and repeatedly read from a
FileInputStream
into that buffer until you've got all the data.如果您确实想将文件映射到内存中,那么
FileChannel
是适当的机制。如果您只想将文件读入一个简单的
byte[]
(并且不需要对该数组进行更改以反映回文件),那么只需读入一个适当大小的来自普通 的byte[]
FileInputStream
应该够了。Guava 具有
Files.toByteArray()
其中做所有给你的。If you really want to map the file into memory, then a
FileChannel
is the appropriate mechanism.If all you want to do is read the file into a simple
byte[]
(and don't need changes to that array to be reflected back to the file), then simply reading into an appropriately-sizedbyte[]
from a normalFileInputStream
should suffice.Guava has
Files.toByteArray()
which does all that for you.有关 ByteArrayOutputStream 的缓冲区增长行为的说明,请阅读此答案。
在回答您的问题时,扩展 ByteArrayOutputStream 是安全的。在您的情况下,最好重写写入方法,将最大额外分配限制为 16MB。您不应重写 toByteArray 来公开受保护的 buf[] 成员。这是因为流不是缓冲区;而是流。流是具有位置指针和边界保护的缓冲区。因此,从类外部访问并可能操作缓冲区是危险的。
For an explanation of the buffer growth behavior of
ByteArrayOutputStream
, please read this answer.In answer to your question, it is safe to extend
ByteArrayOutputStream
. In your situation, it is probably better to override the write methods such that the maximum additional allocation is limited, say, to 16MB. You should not override thetoByteArray
to expose the protected buf[] member. This is because a stream is not a buffer; A stream is a buffer that has a position pointer and boundary protection. So, it is dangerous to access and potentially manipulate the buffer from outside the class.您不应该更改现有方法的指定行为,但添加新方法是完全可以的。这是一个实现:
从 any ByteArrayOutputStream 获取缓冲区的另一种方法是使用它的
writeTo(OutputStream)
方法将缓冲区直接传递给提供的OutputStream:(这有效,但我不确定它是否有用,因为子类化 ByteArrayOutputStream 更简单。)
但是,从你问题的其余部分来看,听起来你想要的只是一个简单的
byte[]
文件的完整内容。从 Java 7 开始,最简单、最快的方法是调用Files.readAllBytes
。在 Java 6 及更低版本中,您可以使用DataInputStream.readFully
,如 Peter Lawrey 的回答 中所示。无论哪种方式,您都将获得一个以正确大小分配一次的数组,而无需重复重新分配 ByteArrayOutputStream。You shouldn't change the specified behavior of the existing method, but it's perfectly fine to add a new method. Here's an implementation:
An alternative but hackish way of getting the buffer from any ByteArrayOutputStream is to use the fact that its
writeTo(OutputStream)
method passes the buffer directly to the provided OutputStream:(That works, but I'm not sure if it's useful, given that subclassing ByteArrayOutputStream is simpler.)
However, from the rest of your question it sounds like all you want is a plain
byte[]
of the complete contents of the file. As of Java 7, the simplest and fastest way to do that is callFiles.readAllBytes
. In Java 6 and below, you can useDataInputStream.readFully
, as in Peter Lawrey's answer. Either way, you will get an array that is allocated once at the correct size, without the repeated reallocation of ByteArrayOutputStream.如果您有 40 MB 的数据,我看不出有什么理由需要超过 40 MB 的数据来创建一个 byte[]。我假设您正在使用不断增长的 ByteArrayOutputStream,它在完成时创建一个 byte[] 副本。
您可以尝试旧的立即读取文件的方法。
如果您可以直接使用 ByteBuffer,那么使用 MappedByteBuffer 会更有效,并且可以避免数据副本(或大量使用堆),但是如果您必须使用 byte[],则它不太有帮助。
If you have 40 MB of data I don't see any reason why it would take more than 40 MB to create a byte[]. I assume you are using a growing ByteArrayOutputStream which creates a byte[] copy when finished.
You can try the old read the file at once approach.
Using a MappedByteBuffer is more efficient and avoids a copy of data (or using the heap much) provided you can use the ByteBuffer directly, however if you have to use a byte[] its unlikely to help much.
我发现这非常令人惊讶...以至于我怀疑您是否正确测量了堆使用情况。
让我们假设您的代码是这样的:
现在 ByteArrayOutputStream 管理其缓冲区的方式是分配初始大小,并且(至少)在缓冲区填满时将其加倍。因此,在最坏的情况下,baos 可能会使用最多 80Mb 的缓冲区来保存 40Mb 的文件。
最后一步分配一个恰好包含
baos.size()
字节的新数组来保存缓冲区的内容。那是 40Mb。因此实际使用的内存峰值量应该是 120Mb。那么这额外的 40Mb 被用在哪里了呢?我的猜测是它们不是,并且您实际上报告的是总堆大小,而不是可访问对象占用的内存量。
那么解决办法是什么呢?
您可以使用内存映射缓冲区。
您可以在分配
ByteArrayOutputStream
时给出大小提示;例如您可以完全放弃
ByteArrayOutputStream
并直接读入字节数组。读取 40Mb 文件时,选项 1 和 2 的峰值内存使用量应为 40Mb;即没有浪费空间。
如果您发布代码并描述测量内存使用情况的方法,将会很有帮助。
潜在的危险是您的假设不正确,或者由于其他人无意中修改了您的代码而变得不正确......
I find this extremely surprising ... to the extent that I have my doubts that you are measuring the heap usage correctly.
Let's assume that your code is something like this:
Now the way that a ByteArrayOutputStream manages its buffer is to allocate an initial size, and (at least) double the buffer when it fills it up. Thus, in the worst case
baos
might use up to 80Mb buffer to hold a 40Mb file.The final step allocates a new array of exactly
baos.size()
bytes to hold the buffer's contents. That's 40Mb. So the peak amount of memory that is actually in use should be 120Mb.So where are those extra 40Mb being used? My guess is that they are not, and that you are actually reporting the total heap size, not the amount of memory that is occupied by reachable objects.
So what is the solution?
You could use a memory mapped buffer.
You could give a size hint when you allocate the
ByteArrayOutputStream
; e.g.You could dispense with the
ByteArrayOutputStream
entirely and read directly into a byte array.Both options 1 and 2 should have an peak memory usage of 40Mb while reading a 40Mb file; i.e. no wasted space.
It would be helpful if you posted your code, and described your methodology for measuring memory usage.
The potential danger is that your assumptions are incorrect, or become incorrect due to someone else modifying your code unwittingly ...
Google Guava ByteSource 似乎是在内存中缓冲的不错选择。与 ByteArrayOutputStream 或 ByteArrayList(来自 Colt 库)等实现不同,它不会将数据合并到一个巨大的字节数组中,而是单独存储每个块。示例:
ByteSource
可以随时作为InputStream
读取:Google Guava ByteSource seems to be a good choice for buffering in memory. Unlike implementations like
ByteArrayOutputStream
orByteArrayList
(from Colt Library) it does not merge the data into a huge byte array but stores every chunk separately. An example:The
ByteSource
can be read as anInputStream
anytime later:...在读取 1GB 文件时,我们得到了同样的观察结果:Oracle 的 ByteArrayOutputStream 有一个惰性内存管理。
字节数组由 int 索引,无论如何限制为 2GB。如果不依赖第三方,您可能会发现这很有用:
... came here with the same observation when reading a 1GB file: Oracle's ByteArrayOutputStream has a lazy memory management.
A byte-Array is indexed by an int and such anyway limited to 2GB. Without dependency on 3rd-party you might find this useful: