如何在不出现“内存不足”的情况下处理数十亿个对象错误
我有一个应用程序可能需要处理数十亿个对象。每个对象都是 TRange 类类型。这些范围是在算法的不同部分创建的,具体取决于某些条件和其他对象属性。因此,如果您有 100 个项目,则无法在不创建所有先前对象的情况下直接创建第 100 个对象。如果我创建所有(数十亿)对象并将其添加到集合中,系统将抛出内存不足错误。现在我想迭代每个对象主要有两个目的:
- 对每个 TRange 对象应用一个操作(例如:输出某些属性)
- 获得某个属性的累积和。(例如:每个范围都有一个权重属性,我想要检索总重量,即所有范围权重的总和)。
如何在不引发内存不足的情况下有效地为这些对象创建迭代器?
我通过将函数指针传递给算法函数来处理第一种情况。例如:
procedure createRanges(aProc: TRangeProc);//aProc is a pointer to function that takes a //TRange
var range: TRange;
rangerec: TRangeRec;
begin
range:=TRange.Create;
try
while canCreateRange do begin//certain conditions needed to create a range
rangerec := ReturnRangeRec;
range.Update(rangerec);//don't create new, use the same object.
if Assigned(aProc) then aProc(range);
end;
finally
range.Free;
end;
end;
但这种方法的问题是,要添加新功能,例如检索我之前提到的总重量,我必须复制算法函数或传递可选的输出参数。请提出一些想法。
提前谢谢大家
普拉迪普
I have an application which may needs to process billions of objects.Each object of is of TRange class type. These ranges are created at different parts of an algorithm which depends on certain conditions and other object properties. As a result, if you have 100 items, you can't directly create the 100th object without creating all the prior objects. If I create all the (billions of) objects and add to the collection, the system will throw Outofmemory error. Now I want to iterate through each object mainly for two purposes:
- To apply an operation for each TRange object(eg:Output certain properties)
- To get a cumulative sum of a certain property.(eg: Each range has a weight property and I want to retreive totalweight that is a sum of all the range weights).
How do I effectively create an Iterator for these object without raising Outofmemory?
I have handled the first case by passing a function pointer to the algorithm function. For eg:
procedure createRanges(aProc: TRangeProc);//aProc is a pointer to function that takes a //TRange
var range: TRange;
rangerec: TRangeRec;
begin
range:=TRange.Create;
try
while canCreateRange do begin//certain conditions needed to create a range
rangerec := ReturnRangeRec;
range.Update(rangerec);//don't create new, use the same object.
if Assigned(aProc) then aProc(range);
end;
finally
range.Free;
end;
end;
But the problem with this approach is that to add a new functionality, say to retrieve the Total weight I have mentioned earlier, either I have to duplicate the algorithm function or pass an optional out parameter. Please suggest some ideas.
Thank you all in advance
Pradeep
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
对于如此大量的数据,您只需将一部分数据存储在内存中。其他数据应串行化到硬盘驱动器。我解决了这样的问题:
这样记录就透明了。您始终可以像在内存中一样访问它们,但它们可能会首先从硬盘驱动器加载。效果非常好。顺便说一句,RAM 的工作方式非常相似,因此它只保存硬盘上所有数据的某个子集。这就是你的工作集。
我没有发布任何代码,因为它超出了问题本身的范围,只会造成混乱。
For such large ammounts of data you need to only have a portion of the data in memory. The other data should be serialized to the hard drive. I tackled such a problem like this:
This way the records are transparent. You always access them as if they are in memory, but they may get loaded from hard drive first. It works really well. By the way RAM works in a very similar way so it only holds a certain subset of all you data on your hard drive. This is your working set then.
I did not post any code because it is beyond the scope of the question itself and would only confuse.
看看 TgsStream64。该类可以通过文件映射来处理大量数据。
http://code.google.com/p /gedemin/source/browse/trunk/Gedemin/Common/gsMMFStream.pas
Look at TgsStream64. This class can handle a huge amounts of data through file mapping.
http://code.google.com/p/gedemin/source/browse/trunk/Gedemin/Common/gsMMFStream.pas
通常是这样完成的:您编写一个枚举器函数(就像您所做的那样),它接收一个回调函数指针(您也这样做了)和一个无类型指针(“Data:pointer”)。您定义一个回调函数,让第一个参数是相同的无类型指针:
然后,如果您想对所有范围求和,您可以这样做:
无论如何,如果您需要创建数十亿个任何东西,您可能做错了(除非您是一名科学家,正在对极其大规模和详细的事物进行建模)。如果您每次想要其中之一都需要创建数十亿个东西,则更是如此。这从来都不是好事。尝试考虑替代解决方案。
It's usually done like this: you write a enumerator function (like you did) which receives a callback function pointer (you did that too) and an untyped pointer ("Data: pointer"). You define a callback function to have first parameter be the same untyped pointer:
Then if you want to, say, sum all ranges, you do it like this:
Anyway, if you need to create billions of ANYTHING you're probably doing it wrong (unless you're a scientist, modelling something extremely large scale and detailed). Even more so if you need to create billions of stuff every time you want one of those. This is never good. Try to think of alternative solutions.
如何处理这个问题,《跑者》有很好的答案!
但我想知道你是否可以做一个快速修复:制作更小的 TRange 对象。
也许你有一个大祖先?你能看一下 TRange 对象的实例大小吗?
也许你最好使用打包记录?
"Runner" has a good answer how to handle this!
But I would like to known if you could do a quick fix: make smaller TRange objects.
Maybe you have a big ancestor? Can you take a look at the instance size of TRange object?
Maybe you better use packed records?
这部分:
听起来有点像计算斐波那契数。也许您可以重用一些 TRange 对象而不是创建冗余副本? 这里是一篇描述这种方法的 C++ 文章 - 它的工作原理是存储已经在哈希图中计算出中间结果。
This part:
sounds a bit like calculating Fibonacci. May be you can reuse some of the TRange objects instead of creating redundant copies? Here is a C++ article describing this approach - it works by storing already calculated intermediate results in a hash map.
处理数十亿个对象是可能的,但您应该尽可能避免它。仅当您绝对必须时才执行此操作...
我曾经创建过一个需要能够处理大量数据的系统。为此,我将对象设置为“可流式传输”,以便可以将它们读/写到磁盘。围绕它的一个更大的类用于决定何时将对象保存到磁盘并从而从内存中删除。基本上,当我调用一个对象时,此类会检查它是否已加载。如果没有,它将再次从磁盘重新创建该对象,将其放在堆栈顶部,然后将底部对象从该堆栈移动/写入到磁盘。因此,我的堆栈具有固定(最大)大小。它允许我使用无限数量的对象,并且性能也相当不错。
不幸的是,我不再有该代码可用。大约七年前,我为前任雇主写了这篇文章。我确实知道您需要为流支持编写一些代码,并为维护所有这些对象的堆栈控制器编写更多代码。但从技术上讲,它允许您创建无限数量的对象,因为您正在用 RAM 内存换取磁盘空间。
Handling billions of objects is possible but you should avoid it as much as possible. Do this only if you absolutely have to...
I did create a system once that needed to be able to handle a huge amount of data. To do so, I made my objects "streamable" so I could read/write them to disk. A larger class around it was used to decide when an object would be saved to disk and thus removed from memory. Basically, when I would call an object, this class would check if it's loaded or not. If not, it would re-create the object again from disk, put it on top of a stack and then move/write the bottom object from this stack to disk. As a result, my stack had a fixed (maximum) size. And it allowed me to use an unlimited amount of objects, with a reasonable good performance too.
Unfortunately, I don't have that code available anymore. I wrote it for a previous employer about 7 years ago. I do know that you would need to write a bit of code for the streaming support plus a bunch more for the stack controller which maintains all those objects. But it technically would allow you to create an unlimited number of objects, since you're trading RAM memory for disk space.