您是否使用多核来提高速度? 你做了什么并且值得付出努力吗?

发布于 2024-07-14 18:36:18 字数 833 浏览 4 评论 0原文

因此,在发布此内容之前,我没有查看正确的位置。

I was looking at the result of the computer language benchmark game:

<http://shootout.alioth.debian.org/u32q/index.php>

And it seems that most of the fastest solutions are still C/C++ using just 
a single core of the 4 core machine that runs the tests.

I was wondering if multi-core is worth it at all for single tasks or if you 
really need some more speed just tune up your code, rewrite in C/C++ instead.

当您单击完整的基准测试链接时,例如: http://shootout.alioth.debian.org/u32q/benchmark.php?test=k核苷酸&lang=all 很明显,相当多的解决方案使用多核。

听到您的个人经历仍然很有趣:

您是否成功地使用 4 或 8 核来实际提高单个任务的性能?

您使用什么工具/语言?

改善有多大?

值得付出努力吗?

So I did not look at the right location before posting this..

I was looking at the result of the computer language benchmark game:

<http://shootout.alioth.debian.org/u32q/index.php>

And it seems that most of the fastest solutions are still C/C++ using just 
a single core of the 4 core machine that runs the tests.

I was wondering if multi-core is worth it at all for single tasks or if you 
really need some more speed just tune up your code, rewrite in C/C++ instead.

When you click on the full benchmark link like: http://shootout.alioth.debian.org/u32q/benchmark.php?test=knucleotide&lang=all it is obvious that quite a few solutions use multiple core.

It would still be interesting to hear of your personal experiences:

Have you had success using 4 or 8 cores in order to actually improve performance on a single task?

What tools/language did you use?

How big was the improvement?

Was it worth the effort?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(11

习惯成性 2024-07-21 18:36:18

最快的解决方案似乎仍然是 C/C++,仅使用运行测试的 4 核机器中的一个核心。

不,并非所有代码都是如此。 事实上,在我看过的代码中,所有代码都使用多个并行线程,因此使用多个内核。 事实上,有些(例如k-核苷酸)使用像 OpenMP 这样的奇特架构(或者,也很有趣,SSE 并行化)以帮助并行化。

编辑 事实上,每个问题的最快 C++ 解决方案都使用并行线程,但有三个例外:

  1. fasta 基准测试,由于随机生成器的使用,很难(但完全可能)并行化。
  2. pidigits,使用GMP 库。
  3. n-body,可以并行化。

…大多数其他解决方案也使用 SSE2 支持。

And it seems that the fastest solutions are still C/C++ using just a single core of the 4 core machine that runs the tests.

No, that's not true for all codes. In fact, of the codes I've looked at, all use multiple parallel threads, and thus multiple cores. In fact, some (e.g. k-nucleotide) use fancy architecture like OpenMP (or, also interesting, SSE parallelization) to help parallelization.

EDIT In fact, the fastest C++ solution for every problem uses parallel threads, with three exceptions:

  1. fasta benchmark, hard (but altogether possible) to parallelize due to random generator usage.
  2. pidigits, uses the GMP library.
  3. n-body, could be parallelized.

… and most other solutions also use SSE2 support.

命硬 2024-07-21 18:36:18

为了提高多核系统上单个任务的性能,您必须将任务设计为分成不同的部分(ala mapreduce)并将每个部分交给不同的核心。 很多程序都会做类似的事情,并且它确实提高了性能。

目前有一些压缩算法支持多个处理器,例如 7zip。 做这样的事情并不是一件非常困难的事情,但是如果你的任务不能分解成多个协作部分,那么你就不会从多个核心获得任何帮助。

In order to increase performance for a single task on a multicore system, you'd have to design your task to split up into different parts (ala mapreduce) and hand each part to a different core. Lots of programs do things like that, and it does increase performance.

A few compression algorithms currently support more than one processor, such as 7zip. It's not terribly difficult to do something like that, but if your task can't break into cooperative parts, you're not going to get any help from more than one core.

可爱咩 2024-07-21 18:36:18

这实际上取决于算法的工作原理以及您正在处理的数据集的大小,以决定它是否可以跨多个核心很好地扩展。 保持在同一个核心上会给您带来很多优势,包括利用处理器流水线以及使用寄存器和缓存 - 所有这些都非常快。

随着多核在未来变得更加重要,我们可能会看到一些有趣的跨核优化变得可用。

It really depends on how the algorithm works and the size of the dataset you're processing as to whether it scales well across multiple cores. Staying on the same core gives you an awful lot of advantages, including taking advantage of processor pipelining and using registers and cache - all of which are super-quick.

As multiple core become more important in the future, we'll probably see some interesting cross-core optimizations becoming available.

萤火眠眠 2024-07-21 18:36:18

您如何定义“单一任务”? 然而,许多单个概念任务可以分为许多独立的子任务。 这就是多核可以提供性能提升的地方。

当然,这需要您实际构建程序,以便这些子任务实际上能够独立处理。

How do you define a "single task"? Many single conceptual tasks can nevertheless be split up into many independent subtasks. That's where multiple cores may provide a performance boost.

Of course, this requires you to actually structure your program so that these subtasks are actually able to be processed independently.

恋竹姑娘 2024-07-21 18:36:18

在执行蒙特卡罗模拟时,我经常使用多核。 在这种情况下,这绝对是天赐之物,因为有时这些模拟会持续很长时间,并且每次运行都独立于其他运行。 事实上,现在我正在等待蒙特卡罗模拟在我的四核上运行。

另一个用例是使用交叉验证测试机器学习算法。 数据集可以加载一次并存储在不可变的对象中。 然后,每个交叉验证迭代可以独立执行。 对于这样的事情,关键是要小心内存分配并避免这涉及到的隐式锁获取。 如果您分配和释放/垃圾收集的频率足够低,则所使用的核心的加速可能接近线性。

I use multicore quite regularly when performing monte carlo simulations. In this case it can be an absolute godsend, because sometimes these simulations take forever and each run is independent of every other run. In fact, right now I'm waiting for a monte carlo simulation to run on my quad core.

Another use case is when testing a machine learning algorithm using cross-validation. The dataset can be loaded once and stored in an immutable object. Then, each cross-validation iteration can be performed independently. For things like this, the key is to be careful about memory allocations and avoid the implicit lock acquisition this involves. If you allocate and free/garbage collect infrequently enough, the speedup can be near linear in cores used.

弥枳 2024-07-21 18:36:18

我已经看到处理管道轻松获得了 4 倍的改进。

I've seen 4x improvement easily gained on a processing pipeline.

惟欲睡 2024-07-21 18:36:18

某些类型的任务可以是普通的多线程,因此可以提高多核系统的性能。

图像处理是可以从多核中受益的领域之一。 例如,应用图像滤镜是一个独立于图像其他部分的结果的过程。 因此,正如前面提到的 Alex Fort 的回答,通过将问题(在本例中为图像过滤)拆分为多个部分并在多个线程中运行处理,我能够看到减少处理时间。

事实上,多线程不仅提高了多核处理器的性能,还提高了我基于 Intel Atom N270 的系统的性能,该系统只有一个核心,但通过同时多线程提供两个逻辑核心(超线程)。

我使用多个线程(通过将处理分为四个线程)和单个线程执行了一些应用图像过滤器的测试。

对于多线程,ExecutorService来自 java.concurrent< /code> package用于协调多线程处理。 实现这个功能相当简单。

虽然不是确切的数字,也不是完美的基准,但在双核 Core 2 Duo 上,多线程代码的处理时间比单线程代码减少了 30-50%,在超线程 Atom 上减少了 20% -30%。

与涉及将问题拆分为多个部分的其他问题一样,这种处理方法的可扩展性将取决于问题拆分和组合的步骤所花费的时间。

There are certain types of tasks which can be trivially multithreaded, and thus, allow a performance increase in systems which have multiple cores.

Image processing is one arena which could benefit from multicore. For example, applying an image filter is a process that is independent of results from other parts of an image. Therefore, as mentioned previously in Alex Fort's answer, by splitting the problem, in this case, the image filtering, into multiple parts and running the processing in multiple threads, I was able to see a decrease in processing time.

In fact, multithreading increased the performance not only on multicore processors, but also on my Intel Atom N270-based system, which only has a single core but offers two logical cores through simultaneous multithreading (hyper-threading).

I performed a few tests of applying an image filter using multiple threads (by splitting the processing into four threads) and a single thread.

For multithreading, the ExecutorService from the java.concurrent package was used to coordinate the multithreaded processing. Implementating this functionality was fairly trivial.

Although not exact numbers, nor an perfect benchmark, on a dual-core Core 2 Duo, the processing time for multithreaded code decreased by 30-50% compared to single-threaded code, and on a hyper-threaded Atom, a decrease of 20-30%.

As with other problems which involve splitting the problem into parts, the scalability of this method of processing is going to depend on the time spent by the steps where the problem is split up and combined.

眼趣 2024-07-21 18:36:18

我用 REALbasic 编写了一个跨平台的 CD 标签编辑程序(因此不能仅仅依赖 GDI+ 或 Cocoa)。 它允许对多个蒙版图像进行分层,并剪切以标记形状。

我从语言内置的图像 blit 和缩放例程切换为使用 插件 它能够使用多达 4 个核心,并显着加快关键用户操作的速度,尤其是在缩放时。

对于嵌入式解决方案来说,这是一个很好的域分离 - 我将单个图像传递给二进制插件,它在内部跨处理器划分工作。 作为一个库解决方案,它不需要我的程序具有多核意识。

I wrote a CD label editing program in REALbasic which was cross-platform (hence not being able to just rely on GDI+ or Cocoa). It allows layering of multiple masked images with clipping to label shapes.

I switched from the image blit and zooming routines built into the language to using a plugin which was able to use up to 4 cores and achieved a significant speedup of key user operations, especially when zoomed.

This was a nice domain separation for a drop-in solution - I passed in a single image to the binary plugin and it internally partitioned the work across the processors. As a library solution, it required no multi-core awareness on the part of my program.

逆流 2024-07-21 18:36:18

make -j 6

在 7 分钟的构建中花费了 6 分钟。 :)

make -j 6

Took 6 minutes out of a 7 minute build. :)

-黛色若梦 2024-07-21 18:36:18

我想提醒大家阿姆达尔定律,这是对收益递减的描述通过增加并行性而获得,并且还可以对给定算法的预期加速程度进行建模。

I would like to remind everyone of Amdahl's Law, which is a description of the diminishing returns gained by increased parallelism, and serves also to model how much speedup can be expected for a given algorithm.

源来凯始玺欢你 2024-07-21 18:36:18

我在一个解决 SVM 的项目中使用 16 个核心(在 Amazon EC2 实例中)实现了很高的加速,我的加速从 10 倍到 16 倍,具体取决于算法使用的数据集

: com/RobeDM/LIBIRWLS" rel="nofollow noreferrer">https://github.com/RobeDM/LIBIRWLS

这是我写的论文:

http://www.sciencedirect.com/science/article/pii/S0167865516302173

I have reached a high speedup using 16 cores (in an Amazon EC2 instance) in a project to solve SVMs, my speedup goes from 10x to 16x depending on the dataset that the algorithm uses:

https://github.com/RobeDM/LIBIRWLS

This is the paper I wrote:

http://www.sciencedirect.com/science/article/pii/S0167865516302173

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文