收藏与记忆
我有一个应用程序,可以读取 3-4 GB 的数据,从每一行中构建实体,然后将它们存储在列表中。
我遇到的问题是,内存疯狂增长,变成 13 到 15 GB。为什么存储这些实体需要这么多内存。
因此,我构建了一棵树并做了类似于霍夫曼编码的操作,总体内存大小变为大约 200 - 300 MB。
我明白,我压缩了数据。但我没想到在列表中存储对象会增加这么多内存。为什么会发生这种事?
其他数据结构如字典、堆栈、队列、数组等怎么样?
在哪里可以找到有关数据结构内部和内存分配的更多信息?
或者我做错了什么?
I have an application that read 3-4 GB s of data, build entities out of each line and then stores them in Lists.
The problem I had is, memory grows insane becomes like 13 to 15 GB. Why the heck storing these entities takes so much memory.
So I build a Tree and did something similar to Huffman Encoding, and overall memory size became around 200 - 300 MB.
I understand, that I compacted the data. But I wasn't expecting that storing objects in the list would increase the memory so much. Why did that happen?
how about other data structures like dictionary, stack, queue, array etc?
Where can I find more information about the internals and memory allocations of data structures?
Or am I doing something wrong?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
在 .NET 中,大对象位于未压缩的大对象堆上。大是指超过 85,000 字节的所有内容。当您增加列表时,它们可能会变得更大,并且一旦超过当前容量就必须重新分配。重定位意味着它们很可能被放在堆的末尾。因此,您最终会得到非常碎片化的 LOH 和大量内存使用。
更新:如果您使用所需的容量初始化列表(我猜您可以从数据库中确定),那么您的内存消耗应该会稍微下降。
In .NET large objects go on the large object heap which is not compacted. Large is everything above 85,000 bytes. When you grow your lists they will probably become larger than that and have to be reallocated once you cross the current capacity. Rellocation means that they are very likely put at the end of the heap. So you end up with a very fragmented LOH and lots of memory usage.
Update: If you initialize your lists with the required capacity (which you can determine from the DB I guess) then your memory consumption should go down a bit.
无论您要使用哪种数据结构,您的内存消耗永远不会低于存储所有数据所需的内存。
你计算过存储一个实例类对象需要多少内存吗?
您的霍夫曼编码是一种节省空间的优化,这意味着您自己消除了类对象中的大量重复数据。这与您用来保存数据的数据结构无关。这取决于您的数据本身的结构,以便您可以利用不同的节省空间的策略(其中霍夫曼编码是多种可能性中的一种,适合消除常见的前缀,并且用于存储它的数据结构是树) 。
现在,回到你的问题。在不优化数据(即对象)的情况下,您可以注意一些事情来提高内存使用效率。
我们所有的物体大小都相似吗?
您是否只是简单地运行一个循环,动态分配内存,然后将它们插入到列表中,如下所示:
在这种情况下,您的列表对象会不断扩展。如果末尾没有足够的可用内存来扩展列表,.NET 将分配一块新的、更大的内存,并将原始数组复制到新内存。本质上,您最终会得到两块内存——原始的内存和新的扩展内存(现在保存列表)。执行此操作很多很多次(因为您显然需要处理 GB 的数据),并且您正在查看大量碎片内存空间。
一次性为整个列表分配足够的内存会更好。
作为后注,我忍不住想知道:您将如何在这个巨大的列表中搜索以找到您需要的东西?您不应该使用二叉树或哈希表之类的东西来帮助您搜索吗?也许您只是读入所有数据,对所有数据执行一些处理,然后将它们写回......
Regardless of the data structure you're going to use, your memory consumption is never going to drop below the memory required to store all your data.
Have you calculated how much memory it is required to store one instance class object?
Your huffman encoding is a space-saving optimization, which means that you are eliminating a lot of duplicated data within your class objects yourself. This has nothing to do with the data structure you use to hold your data. This depends on how your data itself is structured so that you can take advantage of different space-saving strategies (of which huffman encoding is one out of many possibilities, suitable for eliminating common prefixes and the data structure used to store it is a tree).
Now, back to your question. Without optimizing your data (i.e. objects), there are things you can watch out to improve memory usage efficiency.
Are all our objects of similar size?
Did you simply run a loop, allocate memory on-the-fly, then insert them into a list, like this:
In that case, your list object is constantly being expanded. And if there is not enough free memory at the end to expand the list, .NET will allocate a new, larger piece of memory and copies the original array to the new memory. Essentially you end up with two pieces of memory -- the original one, and the new expanded one (now holding the list). Do this many many many times (as you obviously need to for GB's of data), and you are looking at a LOT of fragmented memory spaces.
You'll be better off just allocating enough memory for the entire list at one go.
As an afternote, I can't help but wondering: how in the world are you going to search this HUGE list to find something you need? Shouldn't you be using something like a binary tree or a hash-table to aid in your searching? Maybe you are just reading in all the data, perform some processing on all of them, then writing them back out...
如果您使用类,请阅读以下响应: 了解 32 位与 64 位之间的 CLR 对象大小
在 64 位上(您使用的是 64 位,对吧?)对象开销是 16 个字节加上对对象的引用(有人引用他,对吧?),所以另外 8 个字节字节。因此一个空对象将“吃掉”至少 24 个字节。
如果您使用
List
,请记住List
会加倍增长,因此您可能会浪费大量空间。其他 .NET 集合也以同样的方式增长。我要补充一点,数百万个列表的“纯粹”开销可能会让他的内存崩溃。除了
List
对象“吃掉”的 16 + 8 字节空间之外,它由 2 个 int(8 字节)、一个 SyncLock 引用(8 字节,它是通常为 null)和对内部数组的引用(因此 8 + 16 字节 + 数组)If you are using classes, read the response of this: Understanding CLR object size between 32 bit vs 64 bit
On 64 bits (you are using 64 bits, right?) object overhead is 16 bytes PLUS the reference to the object (someone is referencing him, right?) so another 8 bytes. So an empty object will "eat" at least 24 bytes.
If you are using
List
s, remember thatList
s grow by doubling, so you could be wasting much space. Other .NET collections grow in the same way.I'll add that the "pure" overhead of million of
List
s could bring the memory to his knees. Other than the 16 + 8 bytes of space "eaten" by theList
object, it is composed (in the .NET implementation) of 2 ints (8 bytes), a SyncLock reference (8 bytes, it's null normally) and a reference to the internal array (so 8 + 16 bytes + the array)