当前位置：文江博客话题详情

最大限度地减少 Python 中内存密集型操作的磁盘读取和写入操作

发布于 2024-12-04 02:15:08 字数 1543 浏览 0 评论 0原文

背景

我正在为计算语言学项目开展一个计算量相当大的项目，但我遇到的问题非常普遍，因此我希望其他人也会对解决方案感兴趣。

要求

我必须编写的这个特定程序的关键方面是它必须：

通读大型语料库（5G 到 30G 之间，以及可能更大的内容）
处理每一行上的数据。
根据该处理后的数据，构造大量向量（其中一些向量的维数> 4,000,000）。通常它会构建数十万个这样的向量。
这些向量必须全部以某种格式保存到磁盘上。

第 1 步和第 2 步并不难高效完成：只需使用生成器并拥有数据分析管道即可。最大的问题是操作 3（以及连接 4）

括号：技术细节

如果构建向量的实际过程影响解决方案：

对于语料库中的每一行，一个或多个向量必须有其基础权重已更新。

如果您将它们视为 python 列表，则每一行在处理时都会通过将一个或多个索引处的这些列表的值增加一个值（该值可能因指数）。

向量不相互依赖，读入语料库行的顺序也不重要。

尝试的解决方案

当涉及到如何做到这一点时，存在三个极值：

我可以构建所有向量记忆。然后将它们写入磁盘。
我可以使用 pickle 或某些此类库直接在磁盘上构建所有向量。
我可以一次在内存中构建一个向量并将其写入磁盘，每个向量穿过语料库一次。

所有这些选择都相当棘手。 1 耗尽了所有系统内存，并且出现恐慌并且速度慢得像爬行一样。 2 太慢了，因为 IO 操作不快。出于同样的原因，3 可能甚至比 2 慢。

目标

一个好的解决方案包括：

在内存中构建尽可能多的内容。
内存满后，将所有内容转储到磁盘。
如果再次需要磁盘中的位，请将它们恢复回内存以向这些向量添加内容。
返回到 1，直到构建完所有向量。

问题是我不太确定该怎么做。担心 RAM 等系统属性似乎有些不合时宜，但我不知道如何在不考虑这一点的情况下最佳地解决此类问题。结果，我真的不知道如何开始做这种事情。

问题

有谁知道如何解决此类问题？我的 python 根本不是适合这种事情的语言？或者是否有一个简单的解决方案可以最大化从内存中完成的工作（在合理范围内），同时最小化必须从磁盘读取或写入数据的次数？

非常感谢您的关注。我期待看到 stackoverflow 的聪明人能给我带来什么。

其他详细信息

运行此问题的机器通常具有 20 多个内核和约 70G 的 RAM。该问题可以并行化（类似于 MapReduce），因为可以从语料库的各个片段构建一个实体的单独向量，然后相加以获得从整个语料库构建的向量。

问题的一部分涉及确定在需要进行磁盘写入之前可以在内存中构建多少内存的限制。 python 是否提供任何机制来确定有多少 RAM 可用？

原文

Background

I am working on a fairly computationally intensive project for a computational linguistics project, but the problem I have is quite general and hence I expect that a solution would be interesting to others as well.

Requirements

The key aspect of this particular program I must write is that it must:

Read through a large corpus (between 5G and 30G, and potentially larger stuff down the line)
Process the data on each line.
From this processed data, construct a large number of vectors (dimensionality of some of these vectors is > 4,000,000). Typically it is building hundreds of thousands of such vectors.
These vectors must all be saved to disk in some format or other.

Steps 1 and 2 are not hard to do efficiently: just use generators and have a data-analysis pipeline. The big problem is operation 3 (and by connection 4)

Parenthesis: Technical Details

In case the actual procedure for building vectors affects the solution:

For each line in the corpus, one or more vectors must have its basis weights updated.

If you think of them in terms of python lists, each line, when processed, updates one or more lists (creating them if needed) by incrementing the values of these lists at one or more indices by a value (which may differ based on the index).

Vectors do not depend on each other, nor does it matter which order the corpus lines are read in.

Attempted Solutions

There are three extrema when it comes to how to do this:

I could build all the vectors in memory. Then write them to disk.
I could build all the vectors directly on the disk, using shelf of pickle or some such library.
I could build the vectors in memory one at a time and writing it to disk, passing through the corpus once per vector.

All these options are fairly intractable. 1 just uses up all the system memory, and it panics and slows to a crawl. 2 is way too slow as IO operations aren't fast. 3 is possibly even slower than 2 for the same reasons.

Goals

A good solution would involve:

Building as much as possible in memory.
Once memory is full, dump everything to disk.
If bits are needed from disk again, recover them back into memory to add stuff to those vectors.
Go back to 1 until all vectors are built.

The problem is that I'm not really sure how to go about this. It seems somewhat unpythonic to worry about system attributes such as RAM, but I don't see how this sort of problem can be optimally solved without taking this into account. As a result, I don't really know how to get started on this sort of thing.

Question

Does anyone know how to go about solving this sort of problem? I python simply not the right language for this sort of thing? Or is there a simple solution to maximise how much is done from memory (within reason) while minimising how many times data must be read from the disk, or written to it?

Many thanks for your attention. I look forward to seeing what the bright minds of stackoverflow can throw my way.

Additional Details

The sort of machine this problem is run on usually has 20+ cores and ~70G of RAM. The problem can be parallelised (à la MapReduce) in that separate vectors for one entity can be built from segments of the corpus and then added to obtain the vector that would have been built from the whole corpus.

Part of the question involves determining a limit on how much can be built in memory before disk-writes need to occur. Does python offer any mechanism to determine how much RAM is available?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

蓝海似她心 2024-12-11 02:15:08

看看 pytables。优点之一是您可以处理存储在磁盘上的大量数据，就像存储在内存中一样。

编辑：因为 I/O 性能将成为瓶颈（如果不是瓶颈），您将需要考虑 SSD 技术：每秒高 I/O 并且几乎没有寻道时间。您的项目大小非常适合当今经济实惠的 SSD“驱动器”。

回复收藏 0 原文

欲拥i 2024-12-11 02:15:08

您可能会想到几个您可能想要评估的库：

joblib - 使并行计算变得容易，并提供透明的输出磁盘缓存和延迟重新评估。
mrjob - 可以轻松在 Amazon Elastic MapReduce 或您自己的 Hadoop 集群上编写 Hadoop 流作业.

回复收藏 0 原文

雪落纷纷 2024-12-11 02:15:08

两个想法：

使用numpy数组来表示向量。它们的内存效率更高，但代价是它们将强制向量的元素为同一类型（全部为整数或全部为双精度......）。
进行多次传递，每一次都使用一组不同的向量。也就是说，选择前 1M 个向量并只进行涉及它们的计算（你说它们是独立的，所以我认为这是可行的）。然后使用第二个 1M 向量再次传递所有数据。

看来您已经处于利用硬件所能做的事情的边缘了。如果您能描述哪些硬件（主要是 RAM）可供您执行此任务，将会有所帮助。如果有 100k 个向量，每个向量有 1M 个整数，则约为 370GB。如果多次传递方法可行，并且您有一台具有 16GB RAM 的机器，那么大约需要 25 次传递——如果您有一个集群，那么应该很容易并行化。

回复收藏 0 原文

绾颜 2024-12-11 02:15:08

考虑使用现有的内存数据库解决方案，例如 Redis。 RAM 耗尽后切换到磁盘的问题以及调整此过程的技巧应该已经到位。 Python 客户端也是如此。

此外，该解决方案可以毫不费力地垂直扩展。

回复收藏 0 原文

抚笙 2024-12-11 02:15:08

您没有提到任何一种方式，但如果您没有提到，您应该使用 NumPy 数组作为列表，而不是使用原生 Python 列表，这应该有助于加快速度并减少内存使用，并使您正在做的任何数学运算变得更快、更容易。

如果您非常熟悉 C/C++，您还可以查看 Cython，它可以让您编写一些或所有代码都用 C 编写，这比 Python 快得多，并且与 NumPy 数组集成良好。您可能需要分析您的代码以找出哪些位置占用最多时间，并用 C 语言编写这些部分。

很难说最好的方法是什么，但是当然，您可以在关键部分进行的任何加速都会有所帮助。另请记住，一旦 RAM 耗尽，您的程序将开始在磁盘上的虚拟内存中运行，这可能会导致比程序本身更多的磁盘 I/O 活动，因此如果您担心磁盘 I/O，您的最好的办法可能是确保您在内存中处理的一批数据不会比可用 RAM 大很多。

回复收藏 0 原文

许一世地老天荒 2024-12-11 02:15:08

使用数据库。这个问题似乎足够大，以至于语言选择（Python、Perl、Java 等）不会产生影响。如果向量的每个维度都是表中的一列，那么添加一些索引可能是一个好主意。无论如何，这都是大量数据，并且处理速度不会很快。

回复收藏 0 原文

污味仙女 2024-12-11 02:15:08

我建议这样做：

1）构建您提到的简单管道

2）在内存中构建向量并将它们“刷新”到数据库中。（Redis 和 MongoDB 是很好的候选者）

3）确定该过程消耗多少内存并相应地进行并行化（或者甚至更好地使用映射/归约方法，或分布式任务队列，如芹菜）

加上之前提到的所有技巧（numPy 等..）

回复收藏 0 原文

相守太难 2024-12-11 02:15:08

很难准确地说，因为缺少一些细节，例如。这是专用盒子吗？该进程是否在多台机器上运行？可用内存会改变吗？

一般来说，我建议不要重新实现操作系统的工作。

请注意，下一段似乎并不适用，因为每次都会读取整个文件：
我将测试实现三，给它一个健康的磁盘缓存，看看会发生什么。有了足够的缓存，性能可能不会像您想象的那么糟糕。

您还需要缓存即将需要的昂贵计算。简而言之，当计算出可以再次使用的昂贵操作时，您将其存储在字典中（或者可能是磁盘、memcached 等），然后在再次计算之前先查看那里。 Django 文档有一个很好的介绍。

回复收藏 0 原文

陌上芳菲 2024-12-11 02:15:08

从另一条评论中，我推断你的语料库适合内存，并且你有一些核心可以解决这个问题，所以我会尝试这个：

找到一种方法将你的语料库放入内存中。这可能是一种带有文件系统或数据库的 RAM 磁盘。不知道哪一个最适合您。
有一个较小的 shell 脚本监视 ram 使用情况，并每秒生成以下另一个进程，只要还剩下 x 内存（或者，如果你想让事情变得更复杂一点，y I/O 带宽磁盘）：
- 迭代语料库并构建和编写一些向量
最后，如果需要，您可以收集并组合所有向量（这将是减少部分）

回复收藏 0 原文

陈年往事 2024-12-11 02:15:08

在并行作业之间均匀分割语料库（每个核心一个） - 并行处理，忽略任何不完整的行（或者如果您无法判断它是否不完整，请忽略每个作业处理的第一行和最后一行）。

那是地图部分。

使用一项作业合并来自每个早期作业的 20 多组向量 - 这就是归约步骤。

您可能会从 2*N 行中丢失信息，其中 N 是并行进程的数量，但您可以通过不添加复杂的逻辑来尝试捕获这些行进行处理来获得收益。

回复收藏 0 原文

涙—继续流 2024-12-11 02:15:08

本页其他人讨论的许多方法都非常有帮助，我建议其他需要解决此类问题的人看看它们。

这个问题的关键方面之一是决定何时停止在内存中构建向量（或您正在构建的任何内容）并将内容转储到磁盘。这需要一种（Python 式的）方法来确定还剩多少内存。

事实证明 psutil python 模块就可以解决这个问题。

例如，假设我想要一个 while 循环，将内容添加到队列中以供其他进程处理，直到我的 RAM 已满 80%。下面的伪代码就可以解决这个问题：

while (someCondition):
   if psutil.phymem_usage().percent > 80.0:
      dumpQueue(myQueue,somefile)
   else:
      addSomeStufftoQueue(myQueue,stuff)

通过这种方式，您可以让一个进程跟踪内存使用情况，并决定是时候写入磁盘并释放一些系统内存（决定缓存哪些向量是一个单独的问题）。

附言。支持 Sean 建议此模块。

Many of the methods discussed by others on this page are very helpful, and I recommend that anyone else needing to solve this sort of problem look at them.

One of the crucial aspects of this problem is deciding when to stop building vectors (or whatever you're building) in memory and dump stuff to disk. This requires a (pythonesque) way of determining how much memory one has left.

It turns out that the psutil python module does just the trick.

For example say I want to have a while-loop that adds stuff to a Queue for other processes to deal with until my RAM is 80% full. The follow pseudocode will do the trick:

while (someCondition):
   if psutil.phymem_usage().percent > 80.0:
      dumpQueue(myQueue,somefile)
   else:
      addSomeStufftoQueue(myQueue,stuff)

This way you can have one process tracking memory usage and deciding that it's time to write to disk and free up some system memory (deciding which vectors to cache is a separate problem).

PS. Props to to Sean for suggesting this module.

回复收藏 0 原文

~没有更多了~