如何使用 SSE/x86 高效地进行分散求和

发布于 2024-12-29 10:29:01 字数 634 浏览 2 评论 0原文

我的任务是编写一个程序，以可能的绝对最大速度将向量总和流式传输到分散的内存位置。输入数据是目标 ID 和 XYZ 浮点向量，因此类似于：

[198, {0.4,0,1}],  [775, {0.25,0.8,0}],  [12, {0.5,0.5,0.02}]

我需要将它们累加到内存中，如下所示：

memory[198] += {0.4,0,1}
memory[775] += {0.25,0.8,0}
memory[12]  += {0.5,0.5,0.02}

使事情变得复杂的是，将有多个线程同时执行此操作，从不同的输入流读取数据，但是总结到相同的记忆。我预计不会有很多对相同内存位置的争用，但肯定会有一些。数据集将非常大 - 每个 10+ GB 的多个流，我们将从多个 SSD 同时进行流传输，以获得尽可能高的读取带宽。我假设数学是 SSE，尽管它当然不必如此。

结果暂时不会被使用，所以我不需要污染缓存...但是我正在向内存求和，而不仅仅是写入，所以我不能使用像MOVNTPS这样的东西，对吧？但由于线程不会相互干扰太多，我怎样才能在没有大量锁定开销的情况下做到这一点呢？你会用记忆击剑来做到这一点吗？

感谢您的任何帮助。我可以假设 Nehalem 及以上，如果这有影响的话。

原文

I've been tasked with writing a program that does streaming sums of vectors into scattered memory locations, at the absolute max speed possible. The input data is a destination ID and an XYZ float vectors, so something like:

[198, {0.4,0,1}],  [775, {0.25,0.8,0}],  [12, {0.5,0.5,0.02}]

and I need to sum them into memory like so:

memory[198] += {0.4,0,1}
memory[775] += {0.25,0.8,0}
memory[12]  += {0.5,0.5,0.02}

To complicate matters, there will be multiple threads doing this at the same time, reading from different input streams but summing to the same memory. I don't anticipate there being a lot of contention for the same memory locations, but there will be some. The data sets will be pretty large - multiple streams of 10+ GB apiece that we'll be streaming simultaneously from multiple SSDs to get the highest possible read bandwidth. I'm assuming SSE for the math, although it certainly doesn't have to be that way.

The results won't be used for a while, so I don't need to pollute the cache... but I'm summing into memory, not just writing, so I can't use something like MOVNTPS, right? But since the threads won't be stepping on each other that much, how can I do this without a lot of locking overhead? Would you do this with memory fencing?

Thanks for any help. I can assume Nehalem and above, if that makes a difference.

分享到QQ

分享到微博