随机访问容器不适合内存?
我有一个对象数组(例如图像),它太大而无法放入内存(例如 40GB)。但我的代码需要能够在运行时随机访问这些对象。
最好的方法是什么?
当然,从我的代码的角度来看,如果某些数据位于磁盘上或临时存储在内存中,那应该没有关系;它应该具有透明的访问权限:
container.getObject(1242)->process();
container.getObject(479431)->process();
但是我应该如何实现这个容器?它应该只将请求发送到数据库吗?如果是这样,哪一个是最好的选择? (如果是数据库,那么它应该是免费的,并且没有太多的管理麻烦,也许是 Berkeley DB 或 sqlite?)
我应该自己实现它,在访问后记忆对象并在内存满时清除内存吗?或者有没有好的库(C++)可以实现这一点?
对容器的要求是最大限度地减少磁盘访问(某些元素可能会被我的代码更频繁地访问,因此它们应该保留在内存中)并允许快速访问。
更新:我发现 STXXL 不适用于我的问题,因为我存储在容器中的对象具有动态大小,即我的代码可能会在运行时更新它们(增加或减少某些对象的大小) 。但 STXXL 无法处理这个问题:
STXXL 容器假设数据 他们存储的类型是普通的旧数据 类型(POD)。 http://algo2.iti.kit.edu/dementiev/stxxl/报告/node8.html
您能否评论一下其他解决方案?使用数据库怎么样?哪一个?
I have an array of objects (say, images), which is too large to fit into memory (e.g. 40GB). But my code needs to be able to randomly access these objects at runtime.
What is the best way to do this?
From my code's point of view, it shouldn't matter, of course, if some of the data is on disk or temporarily stored in memory; it should have transparent access:
container.getObject(1242)->process();
container.getObject(479431)->process();
But how should I implement this container? Should it just send the requests to a database? If so, which one would be the best option? (If a database, then it should be free and not too much administration hassle, maybe Berkeley DB or sqlite?)
Should I just implement it myself, memoizing objects after acces sand purging the memory when it's full? Or are there good libraries (C++) for this out there?
The requirements for the container would be that it minimizes disk access (some elements might be accessed more frequently by my code, so they should be kept in memory) and allows fast access.
UPDATE: I turns out that STXXL does not work for my problem because the objects I store in the container have dynamic size, i.e. my code may update them (increasing or decreasing the size of some objects) at runtime. But STXXL cannot handle that:
STXXL containers assume that the data
types they store are plain old data
types (POD).
http://algo2.iti.kit.edu/dementiev/stxxl/report/node8.html
Could you please comment on other solutions? What about using a database? And which one?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
考虑使用 STXXL:
Consider using the STXXL:
您可以查看内存映射文件,然后也访问其中之一。
You could look into memory mapped files, and then access one of those too.
我会实现一个基本的缓存。使用此工作集大小,您将通过具有 x 字节缓存行的集关联缓存获得最佳结果(x == 最适合您的访问模式)。只需在软件中实现每个现代处理器在硬件中已有的功能即可。恕我直言,这应该会给你最好的结果。如果您可以将访问模式优化为某种线性,那么您可以进一步优化它。
I would implement a basic cache. With this workingset size you will have the best results with a set-associative-cache with x byte cache-lines ( x == what best matches your access pattern ). Just implement in software what every modern processor already has in hardware. This should give you imho the best results. You could than optimize it further if you can optimize the accesspattern to be somehow linear.
一种解决方案是使用类似于 B 树的结构、索引以及数组或向量的“页”。这个概念是索引用于确定将哪个页面加载到内存中以访问变量。
如果减小页面大小,则可以在内存中存储多个页面。基于使用频率或其他规则的缓存系统将减少页面加载的数量。
One solution is to use a structure similar to a B-Tree, indices and "pages" of arrays or vectors. The concept is that the index is used to determine which page to load into memory to access your variable.
If you make the page size smaller, you can store multiple pages in memory. A caching system based on frequency of use or other rule, will reduce the number of page loads.
我见过一些非常聪明的代码,它们重载
operator[]()
来动态执行磁盘访问并透明地从磁盘/数据库加载所需的数据。I've seen some very clever code that overloads
operator[]()
to perform disk access on the fly and load required data from disk/database transparently.