累积数万亿个值的分组总和
我有一个数据缩减问题,事实证明该问题很难解决。
本质上,我有一个程序,可以从总共约 6000 万个键的集合中计算键对的增量值(浮点)。该程序将“相对”快速地生成大约 53 万亿对的值(简单地迭代这些值大约需要三天时间
I have a data reduction issue that is proving to be very difficult to solve.
Essentially, I have a program that calculates incremental values (floating point) for pairs of keys from a set of about 60 million keys total. The program will generate values for about 53 trillion pairs 'relatively' quickly (simply iterating through the values would take about three days ????). Not every pair of keys will occur, and many pairs will come up many times. There is no reasonable way to have the pairs come up in a particular order. What I need is a way to find the sum of the values generated for each pair of keys.
For data that would fit in memory, this is a very simple problem. In python it would look something like:
from collections import Counter
res = Counter()
for key1,key2,val in data_generator():
res[(key1,key2)] += val
The problem, of course, is that a mapping like that won't fit in memory. So I'm looking for a way to do this efficiently with a mix of on-disk and in-memory processing.
So far I've tried:
- A postgresql table with upserts (
ON CONFLICT UPDATE
). This turned out to be far, far to slow. - A hybrid of in-memory dictionaries in python that write to a RocksDB or LMDB key value store when they get too big. Though these DBs are much faster than postgresql for this kind of task, the time to complete is still on the order of months.
At this point, I'm hoping someone has a better approach that I could try. Is there a way to break this problem up into smaller parts? Is there a standard MapReduce approach to this kind of problem?
Any tips or pointers would be greatly appreciated. Thanks!
Edit:
The computer I'm using has 64GB of RAM, 96 cores (most of my work is very parallelizable), and terabytes of HDD (and some SSD) storage.
It's hard to estimate the total number of key pairs that will be in the reduced result, but it will certainly be at least in the hundreds of billions.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
正如 Frank Yellin 所观察到的,存在一种单轮 MapReduce 算法。映射器生成带有键
key1,key2
和值val
的键值对。 MapReduce 框架按键(shuffle)对这些对进行分组。减速器对值进行求和。为了控制内存使用,MapReduce将中间数据写入磁盘。传统上有
n
个文件,并且所有带有密钥key1,key2
的对都转到文件hash((key1,key2)) mod n
。这里有一个张力:n
应该足够大,以便每个文件都可以由内存中的映射处理,但如果n
太大,那么文件系统就会崩溃超过。粗略计算表明,对于您来说,n
可能介于 1e4 和 1e5 之间。希望操作系统将使用 RAM 来缓冲文件写入,但请确保最大化磁盘吞吐量,否则您可能必须自己实现缓冲。 (也可能有一个合适的框架,但您不必为一台机器编写太多代码。)我同意 user3386109 的观点,您将需要一个真正的大磁盘。如果您可以多次重新生成输入,则可以通过进行 k 次传递(每次仅保存文件的 1/k 部分)来用时间换取空间。
我担心这个 MapReduce 的运行时间相对于平均故障间隔时间来说太大。 MapReduce 传统上是为了容错和并行而分布式的。
如果您可以告诉我们有关输入如何产生以及您计划如何处理输出的信息,我们也许可以为您提供更好的建议。
As Frank Yellin observes, there's a one-round MapReduce algorithm. The mapper produces key-value pairs with key
key1,key2
and valueval
. The MapReduce framework groups these pairs by key (the shuffle). The reducer sums the values.In order to control the memory usage, MapReduce writes the intermediate data to disk. Traditionally there are
n
files, and all of the pairs with keykey1,key2
go to filehash((key1,key2)) mod n
. There is a tension here:n
should be large enough that each file can be handled by an in-memory map, but ifn
is too large, then the file system falls over. Back of the envelope math suggests thatn
might be between 1e4 and 1e5 for you. Hopefully the OS will use RAM to buffer the file writes for you, but make sure that you're maxing out your disk throughput or else you may have to implement buffering yourself. (There also might be a suitable framework, but you don't have to write much code for a single machine.)I agree with user3386109 that you're going to need a Really Big Disk. If you can regenerate the input multiple times, you can trade time for space by making k passes that each save only a 1/k fraction of the files.
I'm concerned that the running time of this MapReduce will be too large relative to the mean time between failures. MapReduce is traditionally distributed for fault tolerance as much as parallelism.
If there's anything you can tell us about how the input arises, and what you're planning to do with the output, we might be able to give you better advice.