内存问题:如果我使用交换,我应该写入文件/数据库吗? (Python)

发布于 2024-09-06 23:00:52 字数 278 浏览 10 评论 0 原文

我正在创建和处理一个非常大的数据集,大约有 3400 万个数据点,目前我将它们存储在内存中的 python 字典中(大约 22,500 个字典,1588 个类实例中每个都有 15 个字典)。虽然我能够在内存中管理这一切,但我用完了所有 RAM 和大部分交换空间。

我需要能够首先生成所有这些数据,然后一次对其中的选定部分进行分析。从效率的角度来看,将其中一些数据写入文件或将其存储在数据库中是否有益?或者我最好只是承受使用交换空间带来的效率损失。如果我应该写入文件/数据库,您是否建议使用任何 python 工具来执行此操作?

I'm creating and processing a very large data set, with about 34 million data points, and I'm currently storing them in python dictionaries in memory (about 22,500 dictionaries, with 15 dictionaries in each of 1588 class instances). While I'm able to manage this all in memory, I'm using up all of my RAM and most of my swap.

I need to be able to first generate all of this data, and then do analysis on select portions of it at a time. Would it be beneficial from an efficiency standpoint to write some of this data to file, or store it in a database? Or am I better off just taking the hit to efficiency that comes with using my swap space. If I should be writing to file/a database, are there any python tools that you would recommend to do so?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

一杆小烟枪 2024-09-13 23:00:52

快速获取关系数据库!或者更多的内存。

如果您使用的是 Python,请从 Python 数据库编程 开始。 SQLite 是一个选择,但我建议MySQL 根据您正在处理的数据量。如果您想要采用面向对象的方法来存储数据,您可能需要查看 SQLAlchemy,但是您如果您最终自己将每个对象类映射到表并仅处理行和列,可能会获得更高的效率。

Get a relational database, fast! Or a whole lot more RAM.

If you're using Python, then start with Python Database Programming. SQLite would be a choice, but I'd suggest MySQL based upon the amount of data you're dealing with. If you want an object-oriented approach to storing your data, you might want to look at SQLAlchemy, but you'll probably get more efficiency if you end up mapping each of your object classes to a table yourself and just coping with rows and columns.

岁月无声 2024-09-13 23:00:52

因为您将查看“选择部分”,所以您的应用程序将能够比虚拟内存更好地利用核心。 VM 很方便,但是根据定义,引用的位置有点愚蠢。

使用数据库。

为了简单起见,我可能会从模块 sqlite3 开始,除非或直到我发现这是一个瓶颈。

Because you will be looking at "select portions", your application will be able to make better use of core than Virtual Memory will. VM is convenient, but - by definition - kinda stupid about locality of reference.

Use a database.

I'd probably start with module sqlite3 on the basis of simplicity, unless or until I find that it is a bottlenck.

离不开的别离 2024-09-13 23:00:52

如果你已经在Python数据结构中拥有这些数据,假设你没有做大量的内存索引(比明显的字典键索引更多),你真的不想使用关系数据库 - 你会付出代价相当大的性能损失却没有任何特别的好处。

您只需将已有的键值对数据从内存中取出,而不是更改其格式。您应该研究键值存储,例如 BDB伏地魔MongoDBScalaris (只是为了仅举几例 - 有些比其他更复杂、更实用,但所有这些都应该可以轻松处理您的数据集),或者对于您认为可能变得更大或更复杂的数据集,您可以查看诸如 Cassandra, Riak< /a> 或 CouchDB (等等)。所有这些系统将为您提供远远优于关系数据库的性能,并更直接地映射到内存数据模型。

话虽如此,当然,如果您的数据集确实可以通过利用关系数据库的优势(复杂关系、多个视图等)来提高性能,那么就可以使用它,但如果满足以下条件,则不应使用关系数据库:您要做的就是将数据结构从内存中取出。

(假设您的访问模式使分页调入/调出相对不频繁的事件,那么仅按段编组/腌制数据并自行管理它可能会提供比关系数据库更好的性能。这是一个不太可能的情况,但如果您只是保留旧数据而没有人真正查看它,您不妨自己将其扔到磁盘上。)

If you have this data in Python data structures already, assuming you're not doing a lot of in-memory indexing (more than the obvious dictionary keys index), you really don't want to use a relational database - you'll pay a considerable performance penalty for no particular benefit.

You just need to get your already key-value-pair data out of memory, not change its' format. You should look into key-value stores like BDB, Voldemort, MongoDB, or Scalaris (just to name a few - some more involved and functional than others, but all should easily handle your dataset), or for a dataset that you think might grow even larger or more complex you can look into systems like Cassandra, Riak, or CouchDB (among others). ALL of these systems will offer you vastly superior performance to a relational database and more directly map to an in-memory data model.

All that being said, of course, if your dataset really could be more performant by leveraging the benefits of a relational database (complex relationships, multiple views, etc.), then go for it, but you shouldn't use a relational database if all you're trying to do is get your data structures out of memory.

(It's also possible that just marshaling/pickling your data in segments and managing it yourself would offer better performance than a relational database, assuming your access pattern made paging in/out a relatively infrequent event. It's a long shot, but if you're just holding old data around and no one really looks at it, you might as well just throw that to disk yourself.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文