确定在具有 n 个核心的计算机中要触发的线程数的最佳方法是什么? (C++)
我有一个包含 10,000,000(1000 万)个元素的 vector
,并且我的工作站有四个核心。有一个名为 ThrFunc
的函数,它对整数进行操作。假设 vector
中每个整数的 ThrFunc
运行时大致相同。
我应该如何确定要触发的最佳线程数?答案是否像元件数量除以核心数量那么简单?还是有更微妙的计算?
编辑以提供额外信息
- 无需屏蔽;每个函数调用只需只读 使用权
I have a vector<int>
with 10,000,000 (10 million) elements, and that my workstation has four cores. There is a function, called ThrFunc
, that operates on an integer. Assume that the runtime for ThrFunc
for each integer in the vector<int>
is roughly the same.
How should I determine the optimal number of threads to fire off? Is the answer as simple as the number of elements divided by the number of cores? Or is there a more subtle computation?
Editing to provide extra information
- No need for blocking; each function invocation needs only read-only
access
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
最佳核心(线程)数量可能取决于内存系统(缓存和 RAM)何时达到饱和。另一个可能发挥作用的因素是核心间锁定(锁定其他核心可能想要访问的内存区域,更新它,然后解锁它)以及它的效率(锁定到位的时间和频率)它已锁定/解锁)。
运行通用软件的单核其代码和数据未针对多核进行优化,其自身内存将接近饱和。在这种情况下,添加更多核心将导致应用程序变慢。
因此,除非您的代码在内存访问上大量节省,否则我猜您问题的答案是一(1)。
The optimal number of cores (threads) will probably be determined by when you achieve saturation of the memory system (caches and RAM). Another factor that could come into play is that of inter-core locking (locking a memory area that other cores might want to access, updating it and then unlocking it) and how efficient it is (how long the lock is in place and how often it is locked/unlocked).
A single core running a generic software whose code and data are not optmized for multi-core will come close to saturating memory all by itself. Adding more cores will, in such a scenario, result in a slower application.
So unless your code economizes heavily on memory accesses I'd guess the answer to your question is one (1).
我找到了一个真实世界的例子,我将在这里为那些想要技术性较低/更直观的答案的人提供:
每个核心拥有多个线程就像在机场为每个扫描仪有两个队列(两个队列上的人最终都有通过)。
一次可以有两个人将行李放在传送带上,但一次只能有一个人可以通过扫描仪。现在,显然在扫描仪的入口处存在一个争用点,但实际情况是大多数时候两个队列都运行良好。
在这个例子中,队列代表线程,扫描器是核心的主要功能。根据一般经验,每个线程的影响是一个核心的 1.25,即,它不像拥有一个全新的核心。因此,如果任务受 CPU 限制,那么可用处理器的数量稍微多一点可能是最好的。
但请注意,如果任务是 IO 绑定的,线程将花费大部分时间等待外部资源,例如数据库连接、文件系统或其他外部数据源,那么您可以分配比可用处理器的数量。
来源 1,来源2
I've found a real world example I'll put here for the ones who want a less technical / more intuitional answer:
Having multiple threads per core is like having two queues in an airport for each scanner(which people on both queues eventually have to pass through).
Two people at a time can put their baggage on the conveyer belt, but only one at a time can pass through the scanner. Now at this point, obviously there's a contention point at the entrance of the scanner, but what happens in reality is most of the times both queues function very well.
In this example, the queues represent threads and the scanner is the main functions of a core. As a general rule of thumb, the impact of each thread is 1.25th a core, i.e., it's not like having an entire new core. So if the task is CPU-bound slightly over the number of available processors is probably best.
But notice that if the task is IO-Bound, where threads will be spending most of their time waiting for external resources such as database connections, file systems, or other external sources of data, then you can assign (many) more threads than the number of available processors.
Source1, Source2
最佳线程数可能是计算机中的核心数或核心数乘以 2。
用更抽象的术语来说,您需要尽可能高的吞吐量。获得最高吞吐量需要线程之间的争用点最少(因为原始问题是可并行化的)。争用点的数量可能是共享一个核心的线程数量或两倍,因为一个核心可以运行一个或两个逻辑线程(两个具有超线程)。
如果您的工作负载使用的可用资源少于四个(Bulldozer 上的 ALU?硬盘访问?),那么您应该创建的线程数量将受到限制。
对于所有硬件问题,找出正确答案的最佳方法是进行测试并找出答案。
The optimal number of threads is likely to be either the number of cores in your machine or the number of cores times two.
In more abstract terms, you want the highest possible throughput. Getting the highest throughput requires the fewest contention points between the threads (since the original problem is trivially parallelizable). The number of contention points is likely to be the number of threads sharing a core or twice that, since a core can either run one or two logical threads (two with hyperthreading).
If your workload makes use of a resource of which you have fewer than four available (ALUs on Bulldozer? Hard disk access?) then the number of threads you should create will be limited by that.
The best way to find out the correct answer is, with all hardware questions, to test and find out.
Borealid 的答案包括测试并找出,按照建议,这是不可能击败的。
但测试这一点可能比您想象的更多:您希望线程尽可能避免数据争用。如果数据完全是只读的,那么如果您的线程正在访问“相似”数据,您可能会看到最佳性能 - 确保一次遍历小块中的数据,因此每个线程都从 一遍又一遍相同的页面。如果数据完全是只读的,那么每个核心都获得自己的缓存行副本就没有问题。 (尽管这可能不会充分利用每个核心的缓存。)
如果数据以任何方式修改,那么如果您通过以下方式使线程彼此远离,您将看到显着的性能增强:很多。大多数缓存沿着缓存行存储数据,并且您迫切希望保留每个缓存行避免在 CPU 之间跳动以获得良好的性能。在这种情况下,您可能希望让不同的线程在实际上相距很远的数据上运行,以避免相互冲突。
因此:如果您在处理数据时更新数据,我建议使用 N 或 2*N 执行线程(对于 N 个核心),以 SIZE/N*M 作为起点,线程 0 到M.(0、1000、2000、3000,用于四个线程和 4000 个数据对象。)这将为您提供向每个核心提供不同缓存行并允许更新继续进行而不会缓存行弹跳的最佳机会:
如果您在处理数据时不更新数据,您可能希望启动 N 或 2*N 个执行线程(对于 N 个核心),以 0、1、2、3 开始,等等......并在每次迭代中将每个元素向前移动 N 或 2*N 个元素。这将允许缓存系统从内存中获取每个页面一次,用几乎相同的数据填充 CPU 缓存,并希望让每个核心都填充新数据。
我还建议直接在代码中使用
sched_setaffinity(2)
来强制不同的线程使用它们自己的处理器。根据我的经验,Linux 的目标是将每个线程保留在其原始处理器上,以至于不会迁移将任务分配给其他空闲的核心。Borealid's answer includes test and find out, which is impossible to beat as advice goes.
But there's perhaps more to testing this than you might think: you want your threads to avoid contention for data wherever possible. If the data is entirely read-only, then you might see best performance if your threads are accessing "similar" data -- making sure to walk through the data in small blocks at a time, so each thread is accessing data from the same pages over and over again. If the data is completely read-only, then there is no problem if each core gets its own copy of the cache lines. (Though this might not make the most use of each core's cache.)
If the data is in any way modified, then you will see significant performance enhancements if you keep the threads away from each other, by a lot. Most caches store data along cache lines, and you desperately want to keep each cache line from bouncing among CPUs for good performance. In that case, you might want to keep the different threads running on data that is actually far apart to avoid ever running into each other.
So: if you're updating the data while working on it, I'd recommend having N or 2*N threads of execution (for N cores), starting them with SIZE/N*M as their starting point, for threads 0 through M. (0, 1000, 2000, 3000, for four threads and 4000 data objects.) This will give you the best chance of feeding different cache lines to each core and allowing updates to proceed without cache line bouncing:
If you're not updating the data while working on it, you might wish to start N or 2*N threads of execution (for N cores), starting them with 0, 1, 2, 3, etc.. and moving each one forward by N or 2*N elements with each iteration. This will allow the cache system to fetch each page from memory once, populate the CPU caches with nearly identical data, and hopefully keep each core populated with fresh data.
I also recommend using
sched_setaffinity(2)
directly in your code to force the different threads to their own processors. In my experience, Linux aims to keep each thread on its original processor so much it will not migrate tasks to other cores that are otherwise idle.假设
ThrFunc
受 CPU 限制,那么您可能需要每个核心一个线程,并在它们之间划分元素。如果函数有一个 I/O 元素,那么答案会更复杂,因为每个核心可以有一个或多个线程在另一个正在执行时等待 I/O。做一些测试,看看会发生什么。
Assuming
ThrFunc
is CPU-bound then you want probably one thread per core, and divide the elements between them.If there's an I/O element to the function then the answer is more complicated, because you can have one or more threads per core waiting for I/O while another is executing. Do some tests and see what happens.
我同意之前的评论。您应该运行测试来确定哪个数字可以产生最佳性能。然而,这只会为您正在优化的特定系统带来最佳性能。大多数情况下,你的程序都会运行在别人的机器上,你不应该对其架构做太多的假设。
以数字方式确定要启动的线程数的一个好方法是使用
“这是 C++11 的一部分,并且应该生成当前系统中的逻辑核心数”。逻辑核心意味着核心的物理数量(如果处理器不支持硬件线程(即超线程))或硬件线程的数量。
还有一个 Boost 函数可以执行相同的操作,请参阅 以编程方式查找机器上的核心数量。
I agree with the previous comments. You should run tests to determine what number yields the best performance. However, this will only yield the best performance for the particular system you're optimizing for. In most scenarios, your program will be run on other people's machines, on the architecture of which you should not make too many assumptions.
A good way to numerically determine the number of threads to start would be to use
This is part of the C++11 and should yield the number of logical cores in the current system. Logical cores means either the physical number of cores - in case the processor does not support hardware threads (ie HyperThreading) - or the number of hardware threads.
There's also a Boost-function that does the same, see Programmatically find the number of cores on a machine.
最佳的线程数应该等于核心数,在这种情况下,如果每个元素的计算都是独立的,那么每个核心的计算能力将得到充分利用。
The optimal number of threads should equal the number of cores, in which situation the computation capacity of each core will be fully utilized, if the computation on each element is independently.