最大限度地减少 Python 中内存密集型操作的磁盘读取和写入操作
背景
我正在为计算语言学项目开展一个计算量相当大的项目,但我遇到的问题非常普遍,因此我希望其他人也会对解决方案感兴趣。
要求
我必须编写的这个特定程序的关键方面是它必须:
- 通读大型语料库(5G 到 30G 之间,以及可能更大的内容)
- 处理每一行上的数据。
- 根据该处理后的数据,构造大量向量(其中一些向量的维数> 4,000,000)。通常它会构建数十万个这样的向量。
- 这些向量必须全部以某种格式保存到磁盘上。
第 1 步和第 2 步并不难高效完成:只需使用生成器并拥有数据分析管道即可。最大的问题是操作 3(以及连接 4)
括号:技术细节
如果构建向量的实际过程影响解决方案:
对于语料库中的每一行,一个或多个向量必须有其基础权重已更新。
如果您将它们视为 python 列表,则每一行在处理时都会通过将一个或多个索引处的这些列表的值增加一个值(该值可能因指数)。
向量不相互依赖,读入语料库行的顺序也不重要。
尝试的解决方案
当涉及到如何做到这一点时,存在三个极值:
- 我可以构建所有向量记忆。然后将它们写入磁盘。
- 我可以使用 pickle 或某些此类库直接在磁盘上构建所有向量。
- 我可以一次在内存中构建一个向量并将其写入磁盘,每个向量穿过语料库一次。
所有这些选择都相当棘手。 1 耗尽了所有系统内存,并且出现恐慌并且速度慢得像爬行一样。 2 太慢了,因为 IO 操作不快。出于同样的原因,3 可能甚至比 2 慢。
目标
一个好的解决方案包括:
- 在内存中构建尽可能多的内容。
- 内存满后,将所有内容转储到磁盘。
- 如果再次需要磁盘中的位,请将它们恢复回内存以向这些向量添加内容。
- 返回到 1,直到构建完所有向量。
问题是我不太确定该怎么做。担心 RAM 等系统属性似乎有些不合时宜,但我不知道如何在不考虑这一点的情况下最佳地解决此类问题。结果,我真的不知道如何开始做这种事情。
问题
有谁知道如何解决此类问题?我的 python 根本不是适合这种事情的语言?或者是否有一个简单的解决方案可以最大化从内存中完成的工作(在合理范围内),同时最小化必须从磁盘读取或写入数据的次数?
非常感谢您的关注。我期待看到 stackoverflow 的聪明人能给我带来什么。
其他详细信息
运行此问题的机器通常具有 20 多个内核和约 70G 的 RAM。该问题可以并行化(类似于 MapReduce),因为可以从语料库的各个片段构建一个实体的单独向量,然后相加以获得从整个语料库构建的向量。
问题的一部分涉及确定在需要进行磁盘写入之前可以在内存中构建多少内存的限制。 python 是否提供任何机制来确定有多少 RAM 可用?
Background
I am working on a fairly computationally intensive project for a computational linguistics project, but the problem I have is quite general and hence I expect that a solution would be interesting to others as well.
Requirements
The key aspect of this particular program I must write is that it must:
- Read through a large corpus (between 5G and 30G, and potentially larger stuff down the line)
- Process the data on each line.
- From this processed data, construct a large number of vectors (dimensionality of some of these vectors is > 4,000,000). Typically it is building hundreds of thousands of such vectors.
- These vectors must all be saved to disk in some format or other.
Steps 1 and 2 are not hard to do efficiently: just use generators and have a data-analysis pipeline. The big problem is operation 3 (and by connection 4)
Parenthesis: Technical Details
In case the actual procedure for building vectors affects the solution:
For each line in the corpus, one or more vectors must have its basis weights updated.
If you think of them in terms of python lists, each line, when processed, updates one or more lists (creating them if needed) by incrementing the values of these lists at one or more indices by a value (which may differ based on the index).
Vectors do not depend on each other, nor does it matter which order the corpus lines are read in.
Attempted Solutions
There are three extrema when it comes to how to do this:
- I could build all the vectors in memory. Then write them to disk.
- I could build all the vectors directly on the disk, using shelf of pickle or some such library.
- I could build the vectors in memory one at a time and writing it to disk, passing through the corpus once per vector.
All these options are fairly intractable. 1 just uses up all the system memory, and it panics and slows to a crawl. 2 is way too slow as IO operations aren't fast. 3 is possibly even slower than 2 for the same reasons.
Goals
A good solution would involve:
- Building as much as possible in memory.
- Once memory is full, dump everything to disk.
- If bits are needed from disk again, recover them back into memory to add stuff to those vectors.
- Go back to 1 until all vectors are built.
The problem is that I'm not really sure how to go about this. It seems somewhat unpythonic to worry about system attributes such as RAM, but I don't see how this sort of problem can be optimally solved without taking this into account. As a result, I don't really know how to get started on this sort of thing.
Question
Does anyone know how to go about solving this sort of problem? I python simply not the right language for this sort of thing? Or is there a simple solution to maximise how much is done from memory (within reason) while minimising how many times data must be read from the disk, or written to it?
Many thanks for your attention. I look forward to seeing what the bright minds of stackoverflow can throw my way.
Additional Details
The sort of machine this problem is run on usually has 20+ cores and ~70G of RAM. The problem can be parallelised (à la MapReduce) in that separate vectors for one entity can be built from segments of the corpus and then added to obtain the vector that would have been built from the whole corpus.
Part of the question involves determining a limit on how much can be built in memory before disk-writes need to occur. Does python offer any mechanism to determine how much RAM is available?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
看看 pytables。优点之一是您可以处理存储在磁盘上的大量数据,就像存储在内存中一样。
编辑:因为 I/O 性能将成为瓶颈(如果不是瓶颈),您将需要考虑 SSD 技术:每秒高 I/O 并且几乎没有寻道时间。您的项目大小非常适合当今经济实惠的 SSD“驱动器”。
take a look at pytables. One of the advantages is you can work with very large amounts of data, stored on disk, as if it were in memory.
edit: Because the I/O performance will be a bottleneck (if not THE bottleneck), you will want to consider SSD technology: high I/O per second and virtually no seeking times. The size of your project is perfect for todays affordable SSD 'drives'.
您可能会想到几个您可能想要评估的库:
joblib - 使并行计算变得容易,并提供透明的输出磁盘缓存和延迟重新评估。
mrjob - 可以轻松在 Amazon Elastic MapReduce 或您自己的 Hadoop 集群上编写 Hadoop 流作业.
A couple libraries come to mind which you might want to evaluate:
joblib - Makes parallel computation easy, and provides transparent disk-caching of output and lazy re-evaluation.
mrjob - Makes it easy to write Hadoop streaming jobs on Amazon Elastic MapReduce or your own Hadoop cluster.
两个想法:
使用numpy数组来表示向量。它们的内存效率更高,但代价是它们将强制向量的元素为同一类型(全部为整数或全部为双精度......)。
进行多次传递,每一次都使用一组不同的向量。也就是说,选择前 1M 个向量并只进行涉及它们的计算(你说它们是独立的,所以我认为这是可行的)。然后使用第二个 1M 向量再次传递所有数据。
看来您已经处于利用硬件所能做的事情的边缘了。如果您能描述哪些硬件(主要是 RAM)可供您执行此任务,将会有所帮助。如果有 100k 个向量,每个向量有 1M 个整数,则约为 370GB。如果多次传递方法可行,并且您有一台具有 16GB RAM 的机器,那么大约需要 25 次传递——如果您有一个集群,那么应该很容易并行化。
Two ideas:
Use numpy arrays to represent vectors. They are much more memory-efficient, at the cost that they will force elements of the vector to be of the same type (all ints or all doubles...).
Do multiple passes, each with a different set of vectors. That is, choose first 1M vectors and do only the calculations involving them (you said they are independent, so I assume this is viable). Then another pass over all the data with second 1M vectors.
It seems you're on the edge of what you can do with your hardware. It would help if you could describe what hardware (mostly, RAM) is available to you for this task. If there are 100k vectors, each of them with 1M ints, this gives ~370GB. If multiple passes method is viable and you've got a machine with 16GB RAM, then it is about ~25 passes -- should be easy to parallelize if you've got a cluster.
考虑使用现有的内存数据库解决方案,例如 Redis。 RAM 耗尽后切换到磁盘的问题以及调整此过程的技巧应该已经到位。 Python 客户端也是如此。
此外,该解决方案可以毫不费力地垂直扩展。
Think about using an existing in-memory DB solution like Redis. The problem of switching to disk once RAM is gone and tricks to tweak this process should already be in place. Python client as well.
Moreover this solution could scale vertically without much effort.
您没有提到任何一种方式,但如果您没有提到,您应该使用 NumPy 数组作为列表,而不是使用原生 Python 列表,这应该有助于加快速度并减少内存使用,并使您正在做的任何数学运算变得更快、更容易。
如果您非常熟悉 C/C++,您还可以查看 Cython,它可以让您编写一些或所有代码都用 C 编写,这比 Python 快得多,并且与 NumPy 数组集成良好。您可能需要分析您的代码以找出哪些位置占用最多时间,并用 C 语言编写这些部分。
很难说最好的方法是什么,但是当然,您可以在关键部分进行的任何加速都会有所帮助。另请记住,一旦 RAM 耗尽,您的程序将开始在磁盘上的虚拟内存中运行,这可能会导致比程序本身更多的磁盘 I/O 活动,因此如果您担心磁盘 I/O,您的最好的办法可能是确保您在内存中处理的一批数据不会比可用 RAM 大很多。
You didn't mention either way, but if you're not, you should use NumPy arrays for your lists rather than native Python lists, which should help speed things up and reduce memory usage, as well as making whatever math you're doing faster and easier.
If you're at all familiar with C/C++, you might also look into Cython, which lets you write some or all of your code in C, which is much faster than Python, and integrates well with NumPy arrays. You might want to profile your code to find out which spots are taking the most time, and write those sections in C.
It's hard to say what the best approach will be, but of course any speedups you can make in critical parts of will help. Also keep in mind that once RAM is exhausted, your program will start running in virtual memory on disk, which will probably cause far more disk I/O activity than the program itself, so if you're concerned about disk I/O, your best bet is probably to make sure that the batch of data you're working on in memory doesn't get much greater than available RAM.
使用数据库。这个问题似乎足够大,以至于语言选择(Python、Perl、Java 等)不会产生影响。如果向量的每个维度都是表中的一列,那么添加一些索引可能是一个好主意。无论如何,这都是大量数据,并且处理速度不会很快。
Use a database. That problem seems large enough that language choice (Python, Perl, Java, etc) won't make a difference. If each dimension of the vector is a column in the table, adding some indexes is probably a good idea. In any case this is a lot of data and won't process terribly quickly.
我建议这样做:
1)构建您提到的简单管道
2)在内存中构建向量并将它们“刷新”到数据库中。 (Redis 和 MongoDB 是很好的候选者)
3)确定该过程消耗多少内存并相应地进行并行化(或者甚至更好地使用映射/归约方法,或分布式任务队列,如 芹菜)
加上之前提到的所有技巧(numPy 等..)
I'd suggest to do it this way:
1) Construct the easy pipeline you mentioned
2) Construct your vectors in memory and "flush" them into a DB. ( Redis and MongoDB are good candidates)
3) Determine how much memory this procedure consumes and parallelize accordingly ( or even better use a map/reduce approach, or a distributed task queue like celery)
Plus all the tips mentioned before (numPy etc..)
很难准确地说,因为缺少一些细节,例如。这是专用盒子吗?该进程是否在多台机器上运行?可用内存会改变吗?
一般来说,我建议不要重新实现操作系统的工作。
请注意,下一段似乎并不适用,因为每次都会读取整个文件:
我将测试实现三,给它一个健康的磁盘缓存,看看会发生什么。有了足够的缓存,性能可能不会像您想象的那么糟糕。
您还需要缓存即将需要的昂贵计算。简而言之,当计算出可以再次使用的昂贵操作时,您将其存储在字典中(或者可能是磁盘、memcached 等),然后在再次计算之前先查看那里。 Django 文档有一个很好的介绍。
Hard to say exactly because there are a few details missing, eg. is this a dedicated box? Does the process run on several machines? Does the avail memory change?
In general I recommend not reimplementing the job of the operating system.
Note this next paragraph doesn't seem to apply since the whole file is read each time:
I'd test implementation three, giving it a healthy disk cache and see what happens. With plenty of cache performance might not be as bad as you'd expect.
You'll also want to cache expensive calculations that will be needed soon. In short, when an expensive operation is calculated that can be used again, you store it in a dictionary (or perhaps disk, memcached, etc), and then look there first before calculating again. The Django docs have a good introduction.
从另一条评论中,我推断你的语料库适合内存,并且你有一些核心可以解决这个问题,所以我会尝试这个:
有一个较小的 shell 脚本监视 ram 使用情况,并每秒生成以下另一个进程,只要还剩下 x 内存(或者,如果你想让事情变得更复杂一点,y I/O 带宽磁盘):
最后,如果需要,您可以收集并组合所有向量(这将是减少部分)
From another comment I infer that your corpus fits into the memory, and you have some cores to throw at the problem, so I would try this:
Have a smallish shell script monitor ram usage, and spawn every second another process of the following, as long as there is x memory left (or, if you want to make things a bit more complex, y I/O bandwith to disk):
in the end you can collect and combine all vectors, if needed (this would be the reduce part)
在并行作业之间均匀分割语料库(每个核心一个) - 并行处理,忽略任何不完整的行(或者如果您无法判断它是否不完整,请忽略每个作业处理的第一行和最后一行)。
那是地图部分。
使用一项作业合并来自每个早期作业的 20 多组向量 - 这就是归约步骤。
您可能会从 2*N 行中丢失信息,其中 N 是并行进程的数量,但您可以通过不添加复杂的逻辑来尝试捕获这些行进行处理来获得收益。
Split the corpus evenly in size between parallel jobs (one per core) - process in parallel, ignoring any incomplete line (or if you cannot tell if it is incomplete, ignore the first and last line of that each job processes).
That's the map part.
Use one job to merge the 20+ sets of vectors from each of the earlier jobs - That's the reduce step.
You stand to loose information from 2*N lines where N is the number of parallel processes, but you gain by not adding complicated logic to try and capture these lines for processing.
本页其他人讨论的许多方法都非常有帮助,我建议其他需要解决此类问题的人看看它们。
这个问题的关键方面之一是决定何时停止在内存中构建向量(或您正在构建的任何内容)并将内容转储到磁盘。这需要一种(Python 式的)方法来确定还剩多少内存。
事实证明 psutil python 模块就可以解决这个问题。
例如,假设我想要一个 while 循环,将内容添加到队列中以供其他进程处理,直到我的 RAM 已满 80%。下面的伪代码就可以解决这个问题:
通过这种方式,您可以让一个进程跟踪内存使用情况,并决定是时候写入磁盘并释放一些系统内存(决定缓存哪些向量是一个单独的问题)。
附言。支持 Sean 建议此模块。
Many of the methods discussed by others on this page are very helpful, and I recommend that anyone else needing to solve this sort of problem look at them.
One of the crucial aspects of this problem is deciding when to stop building vectors (or whatever you're building) in memory and dump stuff to disk. This requires a (pythonesque) way of determining how much memory one has left.
It turns out that the psutil python module does just the trick.
For example say I want to have a while-loop that adds stuff to a Queue for other processes to deal with until my RAM is 80% full. The follow pseudocode will do the trick:
This way you can have one process tracking memory usage and deciding that it's time to write to disk and free up some system memory (deciding which vectors to cache is a separate problem).
PS. Props to to Sean for suggesting this module.