如何使用 SSE/x86 高效地进行分散求和
我的任务是编写一个程序,以可能的绝对最大速度将向量总和流式传输到分散的内存位置。输入数据是目标 ID 和 XYZ 浮点向量,因此类似于:
[198, {0.4,0,1}], [775, {0.25,0.8,0}], [12, {0.5,0.5,0.02}]
我需要将它们累加到内存中,如下所示:
memory[198] += {0.4,0,1}
memory[775] += {0.25,0.8,0}
memory[12] += {0.5,0.5,0.02}
使事情变得复杂的是,将有多个线程同时执行此操作,从不同的输入流读取数据,但是总结到相同的记忆。我预计不会有很多对相同内存位置的争用,但肯定会有一些。数据集将非常大 - 每个 10+ GB 的多个流,我们将从多个 SSD 同时进行流传输,以获得尽可能高的读取带宽。我假设数学是 SSE,尽管它当然不必如此。
结果暂时不会被使用,所以我不需要污染缓存...但是我正在向内存求和,而不仅仅是写入,所以我不能使用像MOVNTPS这样的东西,对吧?但由于线程不会相互干扰太多,我怎样才能在没有大量锁定开销的情况下做到这一点呢?你会用记忆击剑来做到这一点吗?
感谢您的任何帮助。我可以假设 Nehalem 及以上,如果这有影响的话。
I've been tasked with writing a program that does streaming sums of vectors into scattered memory locations, at the absolute max speed possible. The input data is a destination ID and an XYZ float vectors, so something like:
[198, {0.4,0,1}], [775, {0.25,0.8,0}], [12, {0.5,0.5,0.02}]
and I need to sum them into memory like so:
memory[198] += {0.4,0,1}
memory[775] += {0.25,0.8,0}
memory[12] += {0.5,0.5,0.02}
To complicate matters, there will be multiple threads doing this at the same time, reading from different input streams but summing to the same memory. I don't anticipate there being a lot of contention for the same memory locations, but there will be some. The data sets will be pretty large - multiple streams of 10+ GB apiece that we'll be streaming simultaneously from multiple SSDs to get the highest possible read bandwidth. I'm assuming SSE for the math, although it certainly doesn't have to be that way.
The results won't be used for a while, so I don't need to pollute the cache... but I'm summing into memory, not just writing, so I can't use something like MOVNTPS, right? But since the threads won't be stepping on each other that much, how can I do this without a lot of locking overhead? Would you do this with memory fencing?
Thanks for any help. I can assume Nehalem and above, if that makes a difference.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以使用自旋锁同步访问数组元素(每个 ID 一个)并使用 SSE 进行求和。在 C++ 中,根据编译器的不同,内部函数可能可用,例如 流 SIMD 扩展< /a> 和 InterlockExchange 在 Visual C++ 中。
You can use spin locks for synchronized access to array elements (one per ID) and SSE for summing. In C++, depending on the compiler, intrinsic functions may be available, e.g. Streaming SIMD Extensions and InterlockExchange in Visual C++.
您的程序的性能将受到内存带宽的限制。除非您拥有多 CPU(不仅仅是多核)系统,否则不要期望多线程能够显着提高速度。
每个 CPU 启动一个线程。在这些线程之间静态分配目标数据。并为每个线程提供相同的输入数据。这可以更好地利用 NUMA 架构。并避免线程同步的额外内存流量。
如果是单CPU系统,只使用一个线程访问目标数据。
也许,CPU 中更多内核的唯一实际用途是使用额外的线程加载输入数据。
一项明显的优化是将目标数据对齐 16 字节(以避免在访问单个数据元素时触及两个缓存行)。
您可以使用 SIMD 来执行加法,或者允许编译器自动矢量化您的代码,或者只是让此操作完全不优化 - 没关系,与内存带宽问题相比,这不算什么。
至于输出数据污染缓存,MOVNTPS 在这里无能为力,但您可以使用 PREFETCHNTA 提前几步预取输出数据元素,同时最大限度地减少缓存污染。我不知道它会提高性能还是降低性能。它避免了缓存垃圾,但使大部分缓存未使用。
Your program's performance will be limited by memory bandwidth. Don't expect significant speed improvement from multithreading unless you have a multi-CPU (not just multi-core) system.
Start one thread per CPU. Statically distribute destination data between these threads. And provide each thread with the same input data. This allows better use of NUMA architecture. And avoids extra memory traffic for thread synchronization.
In case of single-CPU system, use only one thread accessing destination data.
Probably, the only practical use for more cores in CPUs is to load input data with additional threads.
One obvious optimization is to align destination data by 16 bytes (to avoid touching two cache lines while accessing single data element).
You can use SIMD to perform the addition, or allow compiler to automatically vectorize your code, or just leave this operation completely unoptimized - it doesn't matter, it's nothing compared to the memory bandwidth problems.
As for polluting the cache with output data, MOVNTPS cannot help here, but you can use PREFETCHNTA to prefetch output data elements several steps ahead while minimizing cache pollution. Will it improve performance or degrade it, I don't know. It avoids cache trashing, but leaves most of the cache unused.