在 python 中维护一个大列表

发布于 2024-08-27 04:00:21 字数 2069 浏览 5 评论 0原文

我需要维护一个大的 python pickleable 对象列表。该列表太大,无法全部存储在 RAM 中,因此需要一些数据库\分页机制。我需要该机制支持快速访问列表中的近距离(附近)区域。

该列表应该实现所有 python-list 功能,但大多数时候我将按顺序工作:扫描列表中的某个范围,并在扫描时决定是否要在扫描点中插入\弹出一些节点。

该列表可能非常大(2-3 GB),并且不应一次全部包含在 RAM 中。 节点很小(100-200 字节),但可以包含各种类型的数据。

一个好的解决方案是使用 BTree,其中只有最后访问的存储桶才会加载到 RAM 中。

使用 SQL 表并不好,因为我需要实现复杂的索引键机制。 我的数据不是表格,它是一个简单的Python列表,具有在特定索引中添加元素以及从特定位置弹出元素的功能。

我尝试了 ZODBzc.blist,它实现了一个基于BTree的列表,可以存储在ZODB数据库文件中,但我不知道如何配置它,所以上面的功能将在合理的时间内运行。 我不需要所有的多线程\事务功能。除了我的单线程程序之外,没有其他人会接触数据库文件。

谁能解释一下如何配置 ZODB\zc.blist 以便上述功能快速运行,或者向我展示不同的大列表实现?

我尝试过一些快速而肮脏的代码:

import time
import random

NODE_JUMP = 50000
NODE_ACCESS = 10000

print 'STARTING'


random_bytes = open('/dev/urandom', 'rb')

my_list = list()

nodes_no = 0

while True:
    nodes_no += NODE_JUMP
    start = time.time()
    my_list.extend(random_bytes.read(100) for i in xrange(NODE_JUMP))
    print 'extending to %s nodes took %.2f seconds' % (nodes_no, time.time() - start)

    section_start = random.randint(0, nodes_no -NODE_ACCESS -1)
    start = time.time()
    for index in xrange(section_start, section_start + NODE_ACCESS):
        # rotate the string
        my_list[index] = my_list[index][1:] + my_list[index][0]

    print 'access to %s nodes took %.2f seconds' % (NODE_ACCESS, time.time() - start,)

打印结束于:

extending to 5000000 nodes took 3.49 seconds
access to 10000 nodes took 0.02 seconds
extending to 5050000 nodes took 3.98 seconds
access to 10000 nodes took 0.01 seconds
extending to 5100000 nodes took 2.54 seconds
access to 10000 nodes took 0.01 seconds
extending to 5150000 nodes took 2.19 seconds
access to 10000 nodes took 0.11 seconds
extending to 5200000 nodes took 2.49 seconds
access to 10000 nodes took 0.01 seconds
extending to 5250000 nodes took 3.13 seconds
access to 10000 nodes took 0.05 seconds
Killed (not by me)

I need to maintain a large list of python pickleable objects. The list is too large to be all stored in the RAM, so some database\paging mechanism is required. I need that the mechanism will support fast access for close (nearby) areas in the list.

The list should implement all the python-list features, but most of the time I will work sequentially: scan some range in the list and while scanning decide if I want to insert\pop some nodes in the scanning point.

The list can be very large (2-3 GB), and should not be all contained in the RAM at once.
The nodes are small (100-200 bytes) but can contain various types of data.

A good solution for this could be using a BTree, where only the last accessed buckets are loaded in the RAM.

Using SQL tables is not good, since I'll need to implement a complex index-key mechanism.
My data is not a table, its a simple python list, with the feature of adding elements in specific indexes, and popping elements from specific positions.

I tried ZODB and zc.blist, which implement a BTree based list that can be stored in a ZODB database file, but I don't know how to configure it so the above features will run in reasonable time.
I don't need all the multi-threading\transactioning features. No one else will touch the database-file except for my single-thread program.

Can anyone explain me how to configure the ZODB\zc.blist so the above features will run fast, or show me a different large-list implementation?

Some quick&dirty code that I tried:

import time
import random

NODE_JUMP = 50000
NODE_ACCESS = 10000

print 'STARTING'


random_bytes = open('/dev/urandom', 'rb')

my_list = list()

nodes_no = 0

while True:
    nodes_no += NODE_JUMP
    start = time.time()
    my_list.extend(random_bytes.read(100) for i in xrange(NODE_JUMP))
    print 'extending to %s nodes took %.2f seconds' % (nodes_no, time.time() - start)

    section_start = random.randint(0, nodes_no -NODE_ACCESS -1)
    start = time.time()
    for index in xrange(section_start, section_start + NODE_ACCESS):
        # rotate the string
        my_list[index] = my_list[index][1:] + my_list[index][0]

    print 'access to %s nodes took %.2f seconds' % (NODE_ACCESS, time.time() - start,)

Print ended with:

extending to 5000000 nodes took 3.49 seconds
access to 10000 nodes took 0.02 seconds
extending to 5050000 nodes took 3.98 seconds
access to 10000 nodes took 0.01 seconds
extending to 5100000 nodes took 2.54 seconds
access to 10000 nodes took 0.01 seconds
extending to 5150000 nodes took 2.19 seconds
access to 10000 nodes took 0.11 seconds
extending to 5200000 nodes took 2.49 seconds
access to 10000 nodes took 0.01 seconds
extending to 5250000 nodes took 3.13 seconds
access to 10000 nodes took 0.05 seconds
Killed (not by me)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

油饼 2024-09-03 04:00:21

毕竟使用zc.blist可以带来好的结果,并且在创建DB时设置“cache_size”选项控制将保留在RAM中的数据的大小。如果您不经常执行“transaction.commit”,则已用 RAM 的大小可能会变得更大。通过定义一个大的cache_size并经常执行transaction.commit,blist的最后访问的存储桶将保留在RAM中,让您可以快速访问它们,并且使用的RAM量不会增长太多。

虽然包装非常昂贵,但如果你有一个大硬盘,你就不必经常这样做。

这里有一些代码供您自己尝试。在后台运行“top”并更改cache_size以查看它如何影响已使用的RAM量。

import time
import os
import glob
from ZODB import DB
from ZODB.FileStorage import FileStorage
import transaction
from zc.blist import BList

print('STARTING')

random = open('/dev/urandom', 'rb')


def test_list(my_list, loops = 1000, element_size = 100):
    print('testing list')
    start = time.time()
    for loop in xrange(loops):
        my_list.append(random.read(element_size))
    print('appending %s elements took %.4f seconds' % (loops, time.time() - start))

    start = time.time()
    length = len(my_list)
    print('length calculated in %.4f seconds' % (time.time() - start,))

    start = time.time()
    for loop in xrange(loops):
        my_list.insert(length / 2, random.read(element_size))
    print('inserting %s elements took %.4f seconds' % (loops, time.time() - start))

    start = time.time()
    for loop in xrange(loops):
        my_list[loop] = my_list[loop][1:] + my_list[loop][0]
    print('modifying %s elements took %.4f seconds' % (loops, time.time() - start))

    start = time.time()
    for loop in xrange(loops):
        del my_list[0]
    print('removing %s elements took %.4f seconds' % (loops, time.time() - start))

    start = time.time()
    transaction.commit()
    print('committing all above took %.4f seconds' % (time.time() - start,))

    del my_list[:loops]
    transaction.commit()

    start = time.time()
    pack()
    print('packing after removing %s elements took %.4f seconds' % (loops, time.time() - start))

for filename in glob.glob('database.db*'):    
    try:
        os.unlink(filename)
    except OSError:
        pass

db = DB(FileStorage('database.db'),
        cache_size = 2000)

def pack():
    db.pack()

root = db.open().root()

root['my_list'] = BList()

print('inserting initial data to blist')

for loop in xrange(10):
    root['my_list'].extend(random.read(100) for x in xrange(100000))
    transaction.commit()

transaction.commit()

test_list(root['my_list'])

Using zc.blist can bring good results after all, and setting the "cache_size" option when creating the DB controls the size of the data that will remain in the RAM. The size of used RAM can grow bigger if you don't do "transaction.commit" often enough. By defining a large cache_size and doing transaction.commit often, the last accessed buckets of the blist will stay in the RAM, giving you fast access to them, and the amount of used RAM won't grow too much.

Packing is very expensive though, but if you have a large harddisk, you don't have to do it that often anyway.

Here is some code to try yourself. Run "top" at the background and change cache_size to see how it affects the amount of used RAM.

import time
import os
import glob
from ZODB import DB
from ZODB.FileStorage import FileStorage
import transaction
from zc.blist import BList

print('STARTING')

random = open('/dev/urandom', 'rb')


def test_list(my_list, loops = 1000, element_size = 100):
    print('testing list')
    start = time.time()
    for loop in xrange(loops):
        my_list.append(random.read(element_size))
    print('appending %s elements took %.4f seconds' % (loops, time.time() - start))

    start = time.time()
    length = len(my_list)
    print('length calculated in %.4f seconds' % (time.time() - start,))

    start = time.time()
    for loop in xrange(loops):
        my_list.insert(length / 2, random.read(element_size))
    print('inserting %s elements took %.4f seconds' % (loops, time.time() - start))

    start = time.time()
    for loop in xrange(loops):
        my_list[loop] = my_list[loop][1:] + my_list[loop][0]
    print('modifying %s elements took %.4f seconds' % (loops, time.time() - start))

    start = time.time()
    for loop in xrange(loops):
        del my_list[0]
    print('removing %s elements took %.4f seconds' % (loops, time.time() - start))

    start = time.time()
    transaction.commit()
    print('committing all above took %.4f seconds' % (time.time() - start,))

    del my_list[:loops]
    transaction.commit()

    start = time.time()
    pack()
    print('packing after removing %s elements took %.4f seconds' % (loops, time.time() - start))

for filename in glob.glob('database.db*'):    
    try:
        os.unlink(filename)
    except OSError:
        pass

db = DB(FileStorage('database.db'),
        cache_size = 2000)

def pack():
    db.pack()

root = db.open().root()

root['my_list'] = BList()

print('inserting initial data to blist')

for loop in xrange(10):
    root['my_list'].extend(random.read(100) for x in xrange(100000))
    transaction.commit()

transaction.commit()

test_list(root['my_list'])
最丧也最甜 2024-09-03 04:00:21

使用东京橱柜怎么样?非常快速和简单,就像列表一样,但专为您想要的而构建。
http://1978th.net/tokyocabinet/

How about using Tokyo Cabinet? Very fast and simple, like lists but built for what you want.
http://1978th.net/tokyocabinet/

箜明 2024-09-03 04:00:21

我认为 ZODB 是一个可以使用的工具。它将存储大量任意项目,它处理内存问题。

这是一个工作示例,在本例中,我包含了相互引用并按编号存储在 BTree 中的对象。

import random
from collections import deque

import ZODB
from ZODB.FileStorage import FileStorage
from ZODB.DB import DB
import transaction
import persistent
import BTrees

def random_string(n=100):
    return ''.join([chr(random.randint(0,95)+32) for i in xrange(n)]) 


class Node(persistent.Persistent):
   def __init__(self, value=None):
       if not value:
           self.value =  random_string()

   def setNeighbors(self, refs):
       self.p1 = refs[0]
       self.p2 = refs[1]
       self.p3 = refs[2]
       self.p4 = refs[3]


def getTree():
    storage = FileStorage('c:\\test.zdb')
    db = DB(storage)
    connection = db.open()
    root = connection.root()
    if root.has_key('tree'):
        tree = root['tree']
    else:
        tree = BTrees.OOBTree.OOBTree()
        root['tree'] = tree
        transaction.commit()
    return tree


def fillDB():
    tree = getTree()

    # start with some initial objects.
    nodes = deque([Node(), Node(), Node(), Node()])
    node = Node()

    for n in xrange(20000):
        tree[n] = node           # store the node based on a numeric ID
        node.setNeighbors(nodes) # Make the node refer to more nodes.
        node = nodes.popleft()   # maintain out list of 4 upcoming nodes.
        nodes.append(Node())
        if n % 1000 == 0:
            transaction.commit() # Must commit for data to make it to disk.
            print n
    transaction.commit()
    return tree

此时,tree 变量基本上就像字典一样工作,可以通过键访问。您还可以使用 tree.keys(min, max) 获取某个范围内的键,如 ZODB BTrees API 文档

您可以通过将每个列表放在 ZODB 返回的 root 对象中的不同键下来存储 10 个列表。 root 对象充当 ZODB 对象存储的“网关”。

感谢 ZODB,您还可以使用对象间引用以及 Btree 索引。例如:

tree = getTree()

node1 = tree[1]
print node1.p1.p1.p1.p1.p1.p1.p1.p1.p1.value

I think the ZODB is the tool to use. It will store lots of arbitrary items, it handles memory issues.

Here is a working example, and in this case I included objects that reference each other as well as being stored in a BTree by number.

import random
from collections import deque

import ZODB
from ZODB.FileStorage import FileStorage
from ZODB.DB import DB
import transaction
import persistent
import BTrees

def random_string(n=100):
    return ''.join([chr(random.randint(0,95)+32) for i in xrange(n)]) 


class Node(persistent.Persistent):
   def __init__(self, value=None):
       if not value:
           self.value =  random_string()

   def setNeighbors(self, refs):
       self.p1 = refs[0]
       self.p2 = refs[1]
       self.p3 = refs[2]
       self.p4 = refs[3]


def getTree():
    storage = FileStorage('c:\\test.zdb')
    db = DB(storage)
    connection = db.open()
    root = connection.root()
    if root.has_key('tree'):
        tree = root['tree']
    else:
        tree = BTrees.OOBTree.OOBTree()
        root['tree'] = tree
        transaction.commit()
    return tree


def fillDB():
    tree = getTree()

    # start with some initial objects.
    nodes = deque([Node(), Node(), Node(), Node()])
    node = Node()

    for n in xrange(20000):
        tree[n] = node           # store the node based on a numeric ID
        node.setNeighbors(nodes) # Make the node refer to more nodes.
        node = nodes.popleft()   # maintain out list of 4 upcoming nodes.
        nodes.append(Node())
        if n % 1000 == 0:
            transaction.commit() # Must commit for data to make it to disk.
            print n
    transaction.commit()
    return tree

At this point the tree variable basically works like a dictionary and can be accessed by the key. You can also get the keys in a range by using tree.keys(min, max) as described by the ZODB BTrees API documentation.

You can store your 10 lists by putting each one under a different key in the root object returned by the ZODB. The root object serves as the "gateway" to the ZODB object storage.

Thanks to the ZODB, you can also use the inter-object references as well as the Btree index. For example:

tree = getTree()

node1 = tree[1]
print node1.p1.p1.p1.p1.p1.p1.p1.p1.p1.value
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文