累积数万亿个值的分组总和

发布于 2025-01-11 03:36:12 字数 118 浏览 5 评论 0原文

我有一个数据缩减问题，事实证明该问题很难解决。

本质上，我有一个程序，可以从总共约 6000 万个键的集合中计算键对的增量值（浮点）。该程序将“相对”快速地生成大约 53 万亿对的值（简单地迭代这些值大约需要三天时间

I have a data reduction issue that is proving to be very difficult to solve.

Essentially, I have a program that calculates incremental values (floating point) for pairs of keys from a set of about 60 million keys total. The program will generate values for about 53 trillion pairs 'relatively' quickly (simply iterating through the values would take about three days ????). Not every pair of keys will occur, and many pairs will come up many times. There is no reasonable way to have the pairs come up in a particular order. What I need is a way to find the sum of the values generated for each pair of keys.

For data that would fit in memory, this is a very simple problem. In python it would look something like:

from collections import Counter
res = Counter()
for key1,key2,val in data_generator():
    res[(key1,key2)] += val

The problem, of course, is that a mapping like that won't fit in memory. So I'm looking for a way to do this efficiently with a mix of on-disk and in-memory processing.

So far I've tried:

A postgresql table with upserts (ON CONFLICT UPDATE). This turned out to be far, far to slow.
A hybrid of in-memory dictionaries in python that write to a RocksDB or LMDB key value store when they get too big. Though these DBs are much faster than postgresql for this kind of task, the time to complete is still on the order of months.

At this point, I'm hoping someone has a better approach that I could try. Is there a way to break this problem up into smaller parts? Is there a standard MapReduce approach to this kind of problem?

Any tips or pointers would be greatly appreciated. Thanks!

Edit:
The computer I'm using has 64GB of RAM, 96 cores (most of my work is very parallelizable), and terabytes of HDD (and some SSD) storage.

It's hard to estimate the total number of key pairs that will be in the reduced result, but it will certainly be at least in the hundreds of billions.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

痴意少年 2025-01-18 03:36:12

正如 Frank Yellin 所观察到的，存在一种单轮 MapReduce 算法。映射器生成带有键 key1,key2 和值 val 的键值对。 MapReduce 框架按键（shuffle）对这些对进行分组。减速器对值进行求和。

为了控制内存使用，MapReduce将中间数据写入磁盘。传统上有 n 个文件，并且所有带有密钥 key1,key2 的对都转到文件 hash((key1,key2)) mod n 。这里有一个张力：n 应该足够大，以便每个文件都可以由内存中的映射处理，但如果 n 太大，那么文件系统就会崩溃超过。粗略计算表明，对于您来说，n 可能介于 1e4 和 1e5 之间。希望操作系统将使用 RAM 来缓冲文件写入，但请确保最大化磁盘吞吐量，否则您可能必须自己实现缓冲。（也可能有一个合适的框架，但您不必为一台机器编写太多代码。）

我同意 user3386109 的观点，您将需要一个真正的大磁盘。如果您可以多次重新生成输入，则可以通过进行 k 次传递（每次仅保存文件的 1/k 部分）来用时间换取空间。

我担心这个 MapReduce 的运行时间相对于平均故障间隔时间来说太大。 MapReduce 传统上是为了容错和并行而分布式的。

如果您可以告诉我们有关输入如何产生以及您计划如何处理输出的信息，我们也许可以为您提供更好的建议。