当前位置：文江博客话题详情

确定在具有 n 个核心的计算机中要触发的线程数的最佳方法是什么？ (C++)

发布于 2024-12-27 17:33:19 字数 324 浏览 4 评论 0原文

我有一个包含 10,000,000（1000 万）个元素的 vector，并且我的工作站有四个核心。有一个名为 ThrFunc 的函数，它对整数进行操作。假设 vector 中每个整数的 ThrFunc 运行时大致相同。

我应该如何确定要触发的最佳线程数？答案是否像元件数量除以核心数量那么简单？还是有更微妙的计算？

编辑以提供额外信息

无需屏蔽；每个函数调用只需只读使用权

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

古镇旧梦 2025-01-03 17:33:20

最佳核心（线程）数量可能取决于内存系统（缓存和 RAM）何时达到饱和。另一个可能发挥作用的因素是核心间锁定（锁定其他核心可能想要访问的内存区域，更新它，然后解锁它）以及它的效率（锁定到位的时间和频率）它已锁定/解锁）。

运行通用软件的单核其代码和数据未针对多核进行优化，其自身内存将接近饱和。在这种情况下，添加更多核心将导致应用程序变慢。

因此，除非您的代码在内存访问上大量节省，否则我猜您问题的答案是一（1）。

回复收藏 0 原文

皇甫轩 2025-01-03 17:33:20

我找到了一个真实世界的例子，我将在这里为那些想要技术性较低/更直观的答案的人提供：

每个核心拥有多个线程就像在机场为每个扫描仪有两个队列（两个队列上的人最终都有通过）。

一次可以有两个人将行李放在传送带上，但一次只能有一个人可以通过扫描仪。现在，显然在扫描仪的入口处存在一个争用点，但实际情况是大多数时候两个队列都运行良好。

在这个例子中，队列代表线程，扫描器是核心的主要功能。根据一般经验，每个线程的影响是一个核心的 1.25，即，它不像拥有一个全新的核心。因此，如果任务受 CPU 限制，那么可用处理器的数量稍微多一点可能是最好的。

但请注意，如果任务是 IO 绑定的，线程将花费大部分时间等待外部资源，例如数据库连接、文件系统或其他外部数据源，那么您可以分配比可用处理器的数量。

来源 1，来源2

回复收藏 0 原文

猛虎独行 2025-01-03 17:33:19

最佳线程数可能是计算机中的核心数或核心数乘以 2。

用更抽象的术语来说，您需要尽可能高的吞吐量。获得最高吞吐量需要线程之间的争用点最少（因为原始问题是可并行化的）。争用点的数量可能是共享一个核心的线程数量或两倍，因为一个核心可以运行一个或两个逻辑线程（两个具有超线程）。

如果您的工作负载使用的可用资源少于四个（Bulldozer 上的 ALU？硬盘访问？），那么您应该创建的线程数量将受到限制。

对于所有硬件问题，找出正确答案的最佳方法是进行测试并找出答案。

回复收藏 0 原文

只为守护你 2025-01-03 17:33:19

Borealid 的答案包括测试并找出，按照建议，这是不可能击败的。

但测试这一点可能比您想象的更多：您希望线程尽可能避免数据争用。如果数据完全是只读的，那么如果您的线程正在访问“相似”数据，您可能会看到最佳性能 - 确保一次遍历小块中的数据，因此每个线程都从一遍又一遍相同的页面。如果数据完全是只读的，那么每个核心都获得自己的缓存行副本就没有问题。（尽管这可能不会充分利用每个核心的缓存。）

如果数据以任何方式修改，那么如果您通过以下方式使线程彼此远离，您将看到显着的性能增强：很多。大多数缓存沿着缓存行存储数据，并且您迫切希望保留每个缓存行避免在 CPU 之间跳动以获得良好的性能。在这种情况下，您可能希望让不同的线程在实际上相距很远的数据上运行，以避免相互冲突。

因此：如果您在处理数据时更新数据，我建议使用 N 或 2*N 执行线程（对于 N 个核心），以 SIZE/N*M 作为起点，线程 0 到M.（0、1000、2000、3000，用于四个线程和 4000 个数据对象。）这将为您提供向每个核心提供不同缓存行并允许更新继续进行而不会缓存行弹跳的最佳机会：

+--------------+---------------+--------------+---------------+--- ...
| first thread | second thread | third thread | fourth thread | first ...
+--------------+---------------+--------------+---------------+--- ...

如果您在处理数据时不更新数据，您可能希望启动 N 或 2*N 个执行线程（对于 N 个核心），以 0、1、2、3 开始，等等......并在每次迭代中将每个元素向前移动 N 或 2*N 个元素。这将允许缓存系统从内存中获取每个页面一次，用几乎相同的数据填充 CPU 缓存，并希望让每个核心都填充新数据。

+-----------------------------------------------------+
| 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 ... |
+-----------------------------------------------------+

我还建议直接在代码中使用 sched_setaffinity(2) 来强制不同的线程使用它们自己的处理器。根据我的经验，Linux 的目标是将每个线程保留在其原始处理器上，以至于不会迁移将任务分配给其他空闲的核心。

Borealid's answer includes test and find out, which is impossible to beat as advice goes.

But there's perhaps more to testing this than you might think: you want your threads to avoid contention for data wherever possible. If the data is entirely read-only, then you might see best performance if your threads are accessing "similar" data -- making sure to walk through the data in small blocks at a time, so each thread is accessing data from the same pages over and over again. If the data is completely read-only, then there is no problem if each core gets its own copy of the cache lines. (Though this might not make the most use of each core's cache.)

If the data is in any way modified, then you will see significant performance enhancements if you keep the threads away from each other, by a lot. Most caches store data along cache lines, and you desperately want to keep each cache line from bouncing among CPUs for good performance. In that case, you might want to keep the different threads running on data that is actually far apart to avoid ever running into each other.

So: if you're updating the data while working on it, I'd recommend having N or 2*N threads of execution (for N cores), starting them with SIZE/N*M as their starting point, for threads 0 through M. (0, 1000, 2000, 3000, for four threads and 4000 data objects.) This will give you the best chance of feeding different cache lines to each core and allowing updates to proceed without cache line bouncing:

+--------------+---------------+--------------+---------------+--- ...
| first thread | second thread | third thread | fourth thread | first ...
+--------------+---------------+--------------+---------------+--- ...

If you're not updating the data while working on it, you might wish to start N or 2*N threads of execution (for N cores), starting them with 0, 1, 2, 3, etc.. and moving each one forward by N or 2*N elements with each iteration. This will allow the cache system to fetch each page from memory once, populate the CPU caches with nearly identical data, and hopefully keep each core populated with fresh data.

+-----------------------------------------------------+
| 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 ... |
+-----------------------------------------------------+

I also recommend using sched_setaffinity(2) directly in your code to force the different threads to their own processors. In my experience, Linux aims to keep each thread on its original processor so much it will not migrate tasks to other cores that are otherwise idle.

回复收藏 0 原文

岁月无声 2025-01-03 17:33:19

假设 ThrFunc 受 CPU 限制，那么您可能需要每个核心一个线程，并在它们之间划分元素。

如果函数有一个 I/O 元素，那么答案会更复杂，因为每个核心可以有一个或多个线程在另一个正在执行时等待 I/O。做一些测试，看看会发生什么。

回复收藏 0 原文

回忆躺在深渊里 2025-01-03 17:33:19

我同意之前的评论。您应该运行测试来确定哪个数字可以产生最佳性能。然而，这只会为您正在优化的特定系统带来最佳性能。大多数情况下，你的程序都会运行在别人的机器上，你不应该对其架构做太多的假设。

以数字方式确定要启动的线程数的一个好方法是使用

std::thread::hardware_concurrency()

“这是 C++11 的一部分，并且应该生成当前系统中的逻辑核心数”。逻辑核心意味着核心的物理数量（如果处理器不支持硬件线程（即超线程））或硬件线程的数量。

还有一个 Boost 函数可以执行相同的操作，请参阅以编程方式查找机器上的核心数量。

I agree with the previous comments. You should run tests to determine what number yields the best performance. However, this will only yield the best performance for the particular system you're optimizing for. In most scenarios, your program will be run on other people's machines, on the architecture of which you should not make too many assumptions.

A good way to numerically determine the number of threads to start would be to use

std::thread::hardware_concurrency()

This is part of the C++11 and should yield the number of logical cores in the current system. Logical cores means either the physical number of cores - in case the processor does not support hardware threads (ie HyperThreading) - or the number of hardware threads.

There's also a Boost-function that does the same, see Programmatically find the number of cores on a machine.

回复收藏 0 原文