.NET 多线程算法的令人惊讶的结果

发布于 2024-08-17 08:31:34 字数 919 浏览 4 评论 0原文

我最近编写了一个 C# 控制台时间表算法,该算法基于遗传算法和一些强力例程的组合。最初的结果很有希望,但我认为可以通过拆分强力例程来提高性能在多处理器架构上并行运行。为此,我使用了记录良好的生产者/消费者模型(如这篇精彩文章http ://www.albahari.com/threading/part2.aspx#_ProducerConsumerQWaitHandle)。我更改了代码,在暴力例程期间为每个逻辑处理器创建一个线程。

我的工作站的性能提升非常令人满意。我在以下硬件上运行 Windows XP:

Intel Core 2 Quad CPU 2.33 GHz 3.49 GB RAM

初始测试表明,使用 4 个线程时,平均性能提升约 40%。下一步是将算法的新多线程版本部署到我们更高规格的 UAT 服务器。以下是我们的 UAT 服务器的规格:

Windows 2003 Server R2 Enterprise x64 8 cpu(四核)AMD Opteron 2.70 GHz 255 GB RAM

运行第一轮测试后,我们都非常惊讶地发现算法在高规格 W2003 服务器上的运行速度实际上比在我本地 XP 工作站上运行得慢!事实上,测试似乎表明生成多少个线程并不重要(测试是在应用程序生成 2 到 32 个线程的情况下运行的)。算法在UAT W2003服务器上总是运行速度明显慢?

怎么会这样?当然,该应用程序在 8 cpu(四核)上运行应该比我的 2 四核工作站上运行得更快吗?为什么我们在 W2003 服务器上看不到多线程的性能提升,而 XP 工作站测试显示性能提升高达 40%?

任何帮助或指示将不胜感激。

问候

迈尔斯

I've recently wrote a C# console time tabling algorithm that is based on a combination of a genetic algorithm with a few brute force routines thrown in. The initial results were promising but I figured I could improve the performance by splitting the brute force routines up to run in parallel on multi processor architectures. To do this I used the well documented Producer/Consumer model (as documented in this fantastic article http://www.albahari.com/threading/part2.aspx#_ProducerConsumerQWaitHandle). I changed my code to create one thread per logical processor during the brute force routines.

The performance gains on my work station were very pleasing. I am running Windows XP on the following hardware:

Intel Core 2 Quad CPU
2.33 GHz 3.49 GB RAM

Initial tests indicated average performance gains of approx 40% when using 4 threads. The next step was to deploy the new multi-threading version of the algorithm to our higher spec UAT server. Here is the spec of our UAT server:

Windows 2003 Server R2 Enterprise x64
8 cpu (Quad-Core) AMD Opteron 2.70 GHz 255 GB RAM

After running the first round of tests we were all extremely surprised to find that the algorithm actually runs slower on the high spec W2003 server than on my local XP work station! In fact the tests seem to indicate that it doesn't matter how many threads are generated (tests were ran with the app spawning between 2 to 32 threads). The algorithm always runs significantly slower on the UAT W2003 server?

How could this be? Surely the app should run faster on a 8 cpu (Quad-Core) than my 2 Quad work station? Why are we seeing no performance gains with the multi-threading on the W2003 server whilst the XP workstation tests show gains of up to 40%?

Any help or pointers would be appreciated.

Regards

Myles

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

那片花海 2024-08-24 08:31:34

你需要找出它把时间花在哪里了。难道是像控制台写入速度非常慢这样愚蠢的事情吗?

听起来您也在 x86 和 x64 平台之间进行更改,但您没有说明您的 .NET 应用程序是如何编译的 - 它是在 x64 计算机上以 32 位还是 64 位运行?

You need to find out where it's spending its time. Could it be something silly like very slow console writes?

It sounds like you're changing between an x86 and x64 platform too, but you don't say how your .NET app is compiled - is it running as 32- or 64-bit on the x64 machine?

戴着白色围巾的女孩 2024-08-24 08:31:34

这在很大程度上取决于您的代码和操作系统。如果不检查代码就不可能回答你的问题。多线程很容易出错。

It very much depends on your code and the OS. It is impossible to answer your Q without examining code. It is easy to get multi-threading wrong.

无敌元气妹 2024-08-24 08:31:34

我的猜测(由于缺乏信息而受到限制)是,您可能会因真实共享或更可能因错误共享而遇到问题。

随着更多核心的添加,由于缓存命中过多,错误共享很容易导致算法变慢。如果您的服务器具有较大的缓存行大小,则更有可能发生这种情况。

我特别怀疑这可能是问题所在 - 特别是因为与 1 个线程相比,您在 4 个线程上仅获得了 40% 的提升。通常,您会在线程阈值较低的情况下获得一定程度的可扩展性,然后开始获取导致性能的缓存命中未命中。急剧下降。这可能就是问题所在。

My guess, (which is limited given the lack of information) is that you may be experiencing problems due to true sharing, or more likely, false sharing.

False sharing can easy cause algorithms to slow down as more cores are added, due to the excessive cache hits. If your server has a larger cache line size, this makes it more likely to occur.

I, in particular, suspect this may be the problem - particularly because you're only getting a 40% boost on 4 threads vs. 1. Often, you'll get a certain amount of scalability up to a low threshold of threads, then start getting cache hit misses that cause the perf. to drop dramatically. This may be the issue.

粉红×色少女 2024-08-24 08:31:34

40% 的总加速意味着您的算法要么内存带宽受限,要么您进行了太多的同步。分析器可以在每种情况下提供帮助。

每次等待更多数据处理的调用都是昂贵的。理想情况下,等待新数据或执行同步锁定/解锁所花费的 CPU 时间很少。确保这一点的简单方法是使您的处理有效负载尽可能“大”。

至于生产系统的减速 - 对其进行分析。这里有很多变量。

A 40% total speedup implies that either your algorithm is memory bandwidth constrained, or you are doing far too much synchronization. A profiler can help in each case.

Each call to wait for more data to process is expensive. Ideally, the amount of CPU time spent waiting for new data or performing the synchronized locks/unlocks is tiny. The simple way of ensuring this is making your processing payloads as "large" as possible.

As for the slowdown on your production system - profile it. There are numerous variables here.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文