在 python 中维护一个大列表
我需要维护一个大的 python pickleable 对象列表。该列表太大,无法全部存储在 RAM 中,因此需要一些数据库\分页机制。我需要该机制支持快速访问列表中的近距离(附近)区域。
该列表应该实现所有 python-list 功能,但大多数时候我将按顺序工作:扫描列表中的某个范围,并在扫描时决定是否要在扫描点中插入\弹出一些节点。
该列表可能非常大(2-3 GB),并且不应一次全部包含在 RAM 中。 节点很小(100-200 字节),但可以包含各种类型的数据。
一个好的解决方案是使用 BTree,其中只有最后访问的存储桶才会加载到 RAM 中。
使用 SQL 表并不好,因为我需要实现复杂的索引键机制。 我的数据不是表格,它是一个简单的Python列表,具有在特定索引中添加元素以及从特定位置弹出元素的功能。
我尝试了 ZODB 和 zc.blist,它实现了一个基于BTree的列表,可以存储在ZODB数据库文件中,但我不知道如何配置它,所以上面的功能将在合理的时间内运行。 我不需要所有的多线程\事务功能。除了我的单线程程序之外,没有其他人会接触数据库文件。
谁能解释一下如何配置 ZODB\zc.blist 以便上述功能快速运行,或者向我展示不同的大列表实现?
我尝试过一些快速而肮脏的代码:
import time
import random
NODE_JUMP = 50000
NODE_ACCESS = 10000
print 'STARTING'
random_bytes = open('/dev/urandom', 'rb')
my_list = list()
nodes_no = 0
while True:
nodes_no += NODE_JUMP
start = time.time()
my_list.extend(random_bytes.read(100) for i in xrange(NODE_JUMP))
print 'extending to %s nodes took %.2f seconds' % (nodes_no, time.time() - start)
section_start = random.randint(0, nodes_no -NODE_ACCESS -1)
start = time.time()
for index in xrange(section_start, section_start + NODE_ACCESS):
# rotate the string
my_list[index] = my_list[index][1:] + my_list[index][0]
print 'access to %s nodes took %.2f seconds' % (NODE_ACCESS, time.time() - start,)
打印结束于:
extending to 5000000 nodes took 3.49 seconds access to 10000 nodes took 0.02 seconds extending to 5050000 nodes took 3.98 seconds access to 10000 nodes took 0.01 seconds extending to 5100000 nodes took 2.54 seconds access to 10000 nodes took 0.01 seconds extending to 5150000 nodes took 2.19 seconds access to 10000 nodes took 0.11 seconds extending to 5200000 nodes took 2.49 seconds access to 10000 nodes took 0.01 seconds extending to 5250000 nodes took 3.13 seconds access to 10000 nodes took 0.05 seconds Killed (not by me)
I need to maintain a large list of python pickleable objects. The list is too large to be all stored in the RAM, so some database\paging mechanism is required. I need that the mechanism will support fast access for close (nearby) areas in the list.
The list should implement all the python-list features, but most of the time I will work sequentially: scan some range in the list and while scanning decide if I want to insert\pop some nodes in the scanning point.
The list can be very large (2-3 GB), and should not be all contained in the RAM at once.
The nodes are small (100-200 bytes) but can contain various types of data.
A good solution for this could be using a BTree, where only the last accessed buckets are loaded in the RAM.
Using SQL tables is not good, since I'll need to implement a complex index-key mechanism.
My data is not a table, its a simple python list, with the feature of adding elements in specific indexes, and popping elements from specific positions.
I tried ZODB and zc.blist, which implement a BTree based list that can be stored in a ZODB database file, but I don't know how to configure it so the above features will run in reasonable time.
I don't need all the multi-threading\transactioning features. No one else will touch the database-file except for my single-thread program.
Can anyone explain me how to configure the ZODB\zc.blist so the above features will run fast, or show me a different large-list implementation?
Some quick&dirty code that I tried:
import time
import random
NODE_JUMP = 50000
NODE_ACCESS = 10000
print 'STARTING'
random_bytes = open('/dev/urandom', 'rb')
my_list = list()
nodes_no = 0
while True:
nodes_no += NODE_JUMP
start = time.time()
my_list.extend(random_bytes.read(100) for i in xrange(NODE_JUMP))
print 'extending to %s nodes took %.2f seconds' % (nodes_no, time.time() - start)
section_start = random.randint(0, nodes_no -NODE_ACCESS -1)
start = time.time()
for index in xrange(section_start, section_start + NODE_ACCESS):
# rotate the string
my_list[index] = my_list[index][1:] + my_list[index][0]
print 'access to %s nodes took %.2f seconds' % (NODE_ACCESS, time.time() - start,)
Print ended with:
extending to 5000000 nodes took 3.49 seconds access to 10000 nodes took 0.02 seconds extending to 5050000 nodes took 3.98 seconds access to 10000 nodes took 0.01 seconds extending to 5100000 nodes took 2.54 seconds access to 10000 nodes took 0.01 seconds extending to 5150000 nodes took 2.19 seconds access to 10000 nodes took 0.11 seconds extending to 5200000 nodes took 2.49 seconds access to 10000 nodes took 0.01 seconds extending to 5250000 nodes took 3.13 seconds access to 10000 nodes took 0.05 seconds Killed (not by me)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
毕竟使用zc.blist可以带来好的结果,并且在创建DB时设置“cache_size”选项控制将保留在RAM中的数据的大小。如果您不经常执行“transaction.commit”,则已用 RAM 的大小可能会变得更大。通过定义一个大的cache_size并经常执行transaction.commit,blist的最后访问的存储桶将保留在RAM中,让您可以快速访问它们,并且使用的RAM量不会增长太多。
虽然包装非常昂贵,但如果你有一个大硬盘,你就不必经常这样做。
这里有一些代码供您自己尝试。在后台运行“top”并更改cache_size以查看它如何影响已使用的RAM量。
Using zc.blist can bring good results after all, and setting the "cache_size" option when creating the DB controls the size of the data that will remain in the RAM. The size of used RAM can grow bigger if you don't do "transaction.commit" often enough. By defining a large cache_size and doing transaction.commit often, the last accessed buckets of the blist will stay in the RAM, giving you fast access to them, and the amount of used RAM won't grow too much.
Packing is very expensive though, but if you have a large harddisk, you don't have to do it that often anyway.
Here is some code to try yourself. Run "top" at the background and change cache_size to see how it affects the amount of used RAM.
使用东京橱柜怎么样?非常快速和简单,就像列表一样,但专为您想要的而构建。
http://1978th.net/tokyocabinet/
How about using Tokyo Cabinet? Very fast and simple, like lists but built for what you want.
http://1978th.net/tokyocabinet/
我认为 ZODB 是一个可以使用的工具。它将存储大量任意项目,它处理内存问题。
这是一个工作示例,在本例中,我包含了相互引用并按编号存储在 BTree 中的对象。
此时,
tree
变量基本上就像字典一样工作,可以通过键访问。您还可以使用tree.keys(min, max)
获取某个范围内的键,如 ZODB BTrees API 文档。您可以通过将每个列表放在 ZODB 返回的
root
对象中的不同键下来存储 10 个列表。root
对象充当 ZODB 对象存储的“网关”。感谢 ZODB,您还可以使用对象间引用以及 Btree 索引。例如:
I think the ZODB is the tool to use. It will store lots of arbitrary items, it handles memory issues.
Here is a working example, and in this case I included objects that reference each other as well as being stored in a BTree by number.
At this point the
tree
variable basically works like a dictionary and can be accessed by the key. You can also get the keys in a range by usingtree.keys(min, max)
as described by the ZODB BTrees API documentation.You can store your 10 lists by putting each one under a different key in the
root
object returned by the ZODB. Theroot
object serves as the "gateway" to the ZODB object storage.Thanks to the ZODB, you can also use the inter-object references as well as the Btree index. For example: