在 Python 中保存和加载大型字典的最快方法
我有一本比较大的词典。我怎么知道尺寸?好吧,当我使用 cPickle
保存它时,文件的大小将增长大约。 400MB。 cPickle
应该比 pickle
快得多,但加载和保存此文件只是需要很多时间。我有一台 Linux 机器上的双核笔记本电脑,2.6 Ghz,4GB RAM。有人对在 python 中更快地保存和加载字典有什么建议吗?谢谢
I have a relatively large dictionary. How do I know the size? well when I save it using cPickle
the size of the file will grow approx. 400Mb. cPickle
is supposed to be much faster than pickle
but loading and saving this file just takes a lot of time. I have a Dual Core laptop 2.6 Ghz with 4GB RAM on a Linux machine. Does anyone have any suggestions for a faster saving and loading of dictionaries in python? thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
使用 cPickle 的 protocol=2 选项。默认协议 (0) 速度慢得多,并且会在磁盘上生成更大的文件。
如果您只想使用超出内存容量的字典,则 shelve 模块是一个很好的快速方法- 肮脏的解决方案。它的作用类似于内存中的字典,但将自身存储在磁盘上而不是内存中。 shelve 基于 cPickle,因此请务必将协议设置为 0 以外的任何值。
sqlite 优于 cPickle 将取决于您的用例。您多久写入一次数据?您希望阅读您编写的每条数据多少次?您是否想要对您写入的数据进行搜索,或者一次加载一部分?
如果您要执行一次写入、多次读取以及一次加载一个片段,请务必使用数据库。如果您只写一次,读一次,cPickle(使用除默认协议 = 0 之外的任何协议)将很难被击败。如果您只想要一个大的、持久的字典,请使用 shelve。
Use the protocol=2 option of cPickle. The default protocol (0) is much slower, and produces much larger files on disk.
If you just want to work with a larger dictionary than memory can hold, the shelve module is a good quick-and-dirty solution. It acts like an in-memory dict, but stores itself on disk rather than in memory. shelve is based on cPickle, so be sure to set your protocol to anything other than 0.
The advantages of a database like sqlite over cPickle will depend on your use case. How often will you write data? How many times do you expect to read each datum that you write? Will you ever want to perform a search of the data you write, or load it one piece at a time?
If you're doing write-once, read-many, and loading one piece at a time, by all means use a database. If you're doing write once, read once, cPickle (with any protocol other than the default protocol=0) will be hard to beat. If you just want a large, persistent dict, use shelve.
我知道这是一个老问题,但只是作为那些仍在寻找这个问题答案的人的更新:
protocol
参数已在 python 3 中更新,现在有更快、更高效的选项(即protocol=3
和protocol=4
)这可能无法在 python 2 下工作。您可以在参考中了解更多信息。
为了始终使用您所使用的 python 版本支持的最佳协议,您只需使用
pickle.HIGHEST_PROTOCOL
即可。以下示例取自参考:I know it's an old question but just as an update for those who still looking for an answer to this question:
The
protocol
argument has been updated in python 3 and now there are even faster and more efficient options (i.e.protocol=3
andprotocol=4
) which might not work under python 2.You can read about it more in the reference.
In order to always use the best protocol supported by the python version you're using, you can simply use
pickle.HIGHEST_PROTOCOL
. The following example is taken from the reference:Sqlite
将数据存储在 Sqlite 数据库中可能是值得的。尽管在重构程序以使用 Sqlite 时会产生一些开发开销,但查询数据库也变得更加容易和高效。
您还可以免费获得事务、原子性、序列化、压缩等。
根据您使用的 Python 版本,您可能已经内置了 sqlite。
Sqlite
It might be worthwhile to store the data in a Sqlite database. Although there will be some development overhead when refactoring your program to work with Sqlite, it also becomes much easier and performant to query the database.
You also get transactions, atomicity, serialization, compression, etc. for free.
Depending on what version of Python you're using, you might already have sqlite built-in.
我已经在许多项目中尝试过这种方法,并得出结论,在保存数据方面,
shelve
比pickle
更快。两者在加载数据时执行相同的操作。Shelve
实际上是一个肮脏的解决方案。那是因为你必须非常小心。如果您在打开
shelve
文件后没有将其关闭,或者由于某种原因,当您在打开和关闭文件的过程中代码发生中断,shelve
code> 文件很可能被损坏(导致令人沮丧的 KeyErrors);这真的很烦人,因为我们使用它们的人对它们感兴趣,因为存储我们的大型字典文件,这显然也需要很长时间才能构建这就是为什么搁置是一个肮脏的解决方案......但它仍然更快。所以!
I have tried this for many projects and concluded that
shelve
is faster thanpickle
in saving data. Both perform the same at loading data.Shelve
is in fact a dirty solution.That is because you have to be very careful with it. If you do not close a
shelve
file after opening it, or due to any reason some interruption happens in your code when you're in the middle of opening and closing it, theshelve
file has high chance of getting corrupted (resulting in frustrating KeyErrors); which is really annoying given that we who are using them are interested in them because of storing our LARGE dict files which clearly also took a long time to be constructedAnd that is why shelve is a dirty solution... It's still faster though. So!
您可以测试压缩您的字典(有一些限制,请参阅:这篇文章) 如果磁盘访问是瓶颈,那么效率会很高。
You may test to compress your dictionnary (with some restrictions see : this post) it will be efficient if the disk access is the bottleneck.
这是很多数据...
你的词典有哪些内容?如果它只是原始或固定数据类型,也许真正的数据库或自定义文件格式是更好的选择?
That is a lot of data...
What kind of contents has your dictionary? If it is only primitive or fixed datatypes, maybe a real database or a custom file-format is the better option?