对于大型词典来说,搁置太慢,我可以采取什么措施来提高性能?

发布于 2024-09-14 23:15:53 字数 397 浏览 10 评论 0原文

我正在使用 python 存储一个表,并且需要持久性。

本质上,我将表存储为数字的字典字符串。整个存储都是搁置的,

self.DB=shelve.open("%s%sMoleculeLibrary.shelve"%(directory,os.sep),writeback=True) 

我将writeback设置为True,因为我发现如果不这样做,系统往往会不稳定。

计算完成后,系统需要关闭数据库,并将其存储回来。现在数据库(表)大约有 540MB,并且需要很长时间。当表增长到大约 500MB 时,时间激增。但我需要一张更大的桌子。事实上我需要其中两个。

我可能使用了错误的坚持形式。我可以做什么来提高性能?

I am storing a table using python and I need persistence.

Essentially I am storing the table as a dictionary string to numbers. And the whole is stored with shelve

self.DB=shelve.open("%s%sMoleculeLibrary.shelve"%(directory,os.sep),writeback=True) 

I use writeback to True as I found the system tends to be unstable if I don't.

After the computations the system needs to close the database, and store it back. Now the database (the table) is about 540MB, and it is taking ages. The time exploded after the table grew to about 500MB. But I need a much bigger table. In fact I need two of them.

I am probably using the wrong form of persistence. What can I do to improve performance?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

黑白记忆 2024-09-21 23:15:53

为了存储一个大型的 string : number 键值对字典,我建议使用 JSON 本机存储解决方案,例如 MongoDB。它有一个很棒的 Python API,Pymongo。 MongoDB 本身是轻量级的并且速度非常快,并且 json 对象本身就是 Python 中的字典。这意味着您可以使用 string 键作为对象 ID,从而实现压缩存储和快速查找。

作为代码多么简单的示例,请参阅以下内容:

d = {'string1' : 1, 'string2' : 2, 'string3' : 3}
from pymongo import Connection
conn = Connection()
db = conn['example-database']
collection = db['example-collection']
for string, num in d.items():
    collection.save({'_id' : string, 'value' : num})
# testing
newD = {}
for obj in collection.find():
    newD[obj['_id']] = obj['value']
print newD
# output is: {u'string2': 2, u'string3': 3, u'string1': 1}

您只需从 unicode 转换回来,这很简单。

For storing a large dictionary of string : number key-value pairs, I'd suggest a JSON-native storage solution such as MongoDB. It has a wonderful API for Python, Pymongo. MongoDB itself is lightweight and incredibly fast, and json objects will natively be dictionaries in Python. This means that you can use your string key as the object ID, allowing for compressed storage and quick lookup.

As an example of how easy the code would be, see the following:

d = {'string1' : 1, 'string2' : 2, 'string3' : 3}
from pymongo import Connection
conn = Connection()
db = conn['example-database']
collection = db['example-collection']
for string, num in d.items():
    collection.save({'_id' : string, 'value' : num})
# testing
newD = {}
for obj in collection.find():
    newD[obj['_id']] = obj['value']
print newD
# output is: {u'string2': 2, u'string3': 3, u'string1': 1}

You'd just have to convert back from unicode, which is trivial.

多像笑话 2024-09-21 23:15:53

根据我的经验,我建议使用 Python 附带的 SQLite3。它适用于更大的数据库和关键数字。数百万个密钥和千兆字节的数据不是问题。到那时,搁置就完全浪费了。此外,拥有单独的数据库进程也没有什么好处,它只是需要更多的上下文交换。在我的测试中,我发现在本地处理较大的数据集时,SQLite3 是首选。运行 mongo、mysql 或 postgresql 等本地数据库引擎不会提供任何附加价值,而且速度也较慢。

Based on my experience, I would recommend using SQLite3, which comes with Python. It works well with larger databases and key numbers. Millions of keys and gigabytes of data is not a problem. Shelve is totally wasted at that point. Also having separate db-process isn't beneficial, it just requires more context swaps. In my tests I found out that SQLite3 was the preferred option to use, when handling larger data sets locally. Running local database engine like mongo, mysql or postgresql doesn't provide any additional value and also were slower.

深爱不及久伴 2024-09-21 23:15:53

我认为您的问题是由于您使用了 writeback=True 所致。 文档说(重点是我的):

由于 Python 语义,架子无法知道何时可变
持久字典条目被修改。默认修改对象
仅当分配给架子时才写入(参见示例)。如果
可选的 writeback 参数设置为 True,所有访问的条目都是
也缓存在内存中,并在sync()和close()上写回;这
可以更方便地改变持久化中的可变条目
字典,但是,如果访问许多条目,则可能会消耗大量资源
用于缓存的内存量,并且可以进行关闭操作
非常慢,因为所有访问的条目都被写回(没有办法
确定哪些访问的条目是可变的,哪些是可变的
实际上发生了突变)。

您可以避免使用 writeback=True 并确保数据仅写入一次(您必须注意后续修改将会丢失)。

如果您认为这不是正确的存储选项(在不知道数据如何构造的情况下很难说),我建议使用 sqlite3,它集成在 python 中(因此非常便携)并且具有非常好的性能。它比简单的键值存储要复杂一些。

请参阅其他答案以获取替代方案。

I think your problem is due to the fact that you use the writeback=True. The documentation says (emphasis is mine):

Because of Python semantics, a shelf cannot know when a mutable
persistent-dictionary entry is modified. By default modified objects
are written only when assigned to the shelf (see Example). If the
optional writeback parameter is set to True, all entries accessed are
also cached in memory, and written back on sync() and close(); this
can make it handier to mutate mutable entries in the persistent
dictionary, but, if many entries are accessed, it can consume vast
amounts of memory for the cache, and it can make the close operation
very slow since all accessed entries are written back
(there is no way
to determine which accessed entries are mutable, nor which ones were
actually mutated).

You could avoid using writeback=True and make sure the data is written only once (you have to pay attention that subsequent modifications are going to be lost).

If you believe this is not the right storage option (it's difficult to say without knowing how the data is structured), I suggest sqlite3, it's integrated in python (thus very portable) and has very nice performances. It's somewhat more complicated than a simple key-value store.

See other answers for alternatives.

吃→可爱长大的 2024-09-21 23:15:53

大了多少?访问模式是什么?您需要对其进行哪些类型的计算?

请记住,如果无论您如何操作都无法将表保留在内存中,那么您将会遇到一些性能限制。

您可能想考虑使用 SQLAlchemy,或者直接使用 bsddb 之类的东西,但这两种方法都会牺牲代码的简单性。但是,使用 SQL,您可以根据工作负载将部分工作卸载到数据库层。

How much larger? What are the access patterns? What kinds of computation do you need to do on it?

Keep in mind that you are going to have some performance limits if you can't keep the table in memory no matter how you do it.

You may want to look at going to SQLAlchemy, or directly using something like bsddb, but both of those will sacrifice simplicity of code. However, with SQL you may be able to offload some of the work to the database layer depending on the workload.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文