您如何利用多核?

发布于 2024-07-10 09:40:19 字数 1338 浏览 8 评论 0原文

作为来自企业 Web 开发领域的 HPC 领域的人,我'我总是好奇“现实世界”中的开发人员如何利用并行计算。 现在,所有芯片都将转向多核,这一点就更加重要了,而且它会当芯片上有数千个内核而不是几个内核时,这一点就更有意义。

我的问题是:

  1. 这对您的软件路线图有何影响?
  2. 我对有关多核如何影响不同软件领域的真实故事特别感兴趣,因此请在答案中指定您进行的开发类型(例如服务器端、客户端应用程序、科学计算等) 。
  3. 您如何利用现有代码来利用多核机器?您遇到了哪些挑战? 您是否使用 OpenMPErlangHaskellCUDATBBUPC 还是其他?
  4. 随着并发水平不断增加,您打算做什么?您将如何处理数百或数千个核心?
  5. 如果您的领域不容易从并行计算中受益,那么解释原因也很有趣。

最后,我将其视为一个多核问题,但请随意讨论其他类型的并行计算。 如果您要移植应用的一部分以使用 MapReduce,或者MPI 是适合您的范例,那么也一定要提及这一点。

更新:如果您确实回答#5,请提及如果内核数量(100、1000 等)多于可用内存带宽所能提供的数量,您认为情况是否会发生变化(参见带宽如何每个核心变得越来越小)。 您仍然可以为您的应用程序使用剩余的内核吗?

As someone in the world of HPC who came from the world of enterprise web development, I'm always curious to see how developers back in the "real world" are taking advantage of parallel computing. This is much more relevant now that all chips are going multicore, and it'll be even more relevant when there are thousands of cores on a chip instead of just a few.

My questions are:

  1. How does this affect your software roadmap?
  2. I'm particularly interested in real stories about how multicore is affecting different software domains, so specify what kind of development you do in your answer (e.g. server side, client-side apps, scientific computing, etc).
  3. What are you doing with your existing code to take advantage of multicore machines, and what challenges have you faced? Are you using OpenMP, Erlang, Haskell, CUDA, TBB, UPC or something else?
  4. What do you plan to do as concurrency levels continue to increase, and how will you deal with hundreds or thousands of cores?
  5. If your domain doesn't easily benefit from parallel computation, then explaining why is interesting, too.

Finally, I've framed this as a multicore question, but feel free to talk about other types of parallel computing. If you're porting part of your app to use MapReduce, or if MPI on large clusters is the paradigm for you, then definitely mention that, too.

Update: If you do answer #5, mention whether you think things will change if there get to be more cores (100, 1000, etc) than you can feed with available memory bandwidth (seeing as how bandwidth is getting smaller and smaller per core). Can you still use the remaining cores for your application?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(22

酒几许 2024-07-17 09:40:19

我的研究工作包括编译器和垃圾邮件过滤方面的工作。 我还做了很多 Unix 方面的“个人生产力”工作。 另外,我编写并使用软件来管理我所教授的课程,其中包括评分、测试学生代码、跟踪成绩以及无数其他琐事。

  1. 多核对我没有任何影响,除了作为编译器支持其他应用程序的研究问题。 但这些问题主要在于运行时系统,而不是编译器。
  2. 戴夫·沃特曼 (Dave Wortman) 在 1990 年左右克服了巨大的困难和代价,证明了可以并行化编译器以保持四个处理器忙碌。 我认识的人中没有人重复过这个实验。 大多数编译器的速度足以运行单线程。 在几个不同的源文件上并行运行顺序编译器比使编译器本身并行要容易得多。 对于垃圾邮件过滤,学习本质上是一个连续的过程。 即使是较旧的机器每秒也可以学习数百条消息,因此即使是大型语料库也可以在一分钟内学习。 再次强调,训练速度足够快
  3. 我利用并行机的唯一重要方法是使用并行make。 这是一个巨大的福音,并且大型构建很容易并行化。 Make 几乎自动完成所有工作。 我唯一记得的另一件事是使用并行性来计时长时间运行的学生代码,将其分配给一堆实验室机器,我可以凭良心这样做,因为我只破坏每台机器的一个核心,所以只使用 1 /4 的 CPU 资源。 哦,我写了一个 Lua 脚本,在使用 lame 翻录 MP3 文件时将使用所有 4 个内核。 这个剧本需要做很多工作才能得到正确的结果。
  4. 我将忽略数十个、数百个和数千个核心。 我第一次被告知“并行机即将到来;你必须做好准备”是在 1984 年。并行编程是高技能专家的领域,无论是当时还是今天都是如此。 唯一改变的是,如今制造商强迫我们为并行硬件付费,无论我们是否愿意。 但是仅仅因为硬件是付费的并不意味着它可以免费使用。编程模型很糟糕,并且使线程/互斥模型工作,更不用说表现良好了即使硬件是免费的,也是一项昂贵的工作。 我希望大多数程序员会忽略并行性并安静地继续他们的工作。 当一位熟练的专家推出一款平行制作或一款出色的电脑游戏时,我会默默地鼓掌并利用他们的努力。 如果我希望自己的应用程序具有性能,我将专注于减少内存分配并忽略并行性。
  5. 并行性确实很难。 大多数域都很难并行化。 像并行 make 这样广泛可重用的异常是令人高兴的。

摘要(这是我从一家领先的 CPU 制造商工作的主讲人那里听到的):业界之所以支持多核,是因为他们无法继续让机器运行得更快、更热,而且他们不知道如何处理额外的晶体管。 现在他们迫切希望找到一种使多核盈利的方法,因为如果他们没有利润,他们就无法建造下一代晶圆厂生产线。 肉汁列车已经结束,我们实际上可能必须开始关注软件成本。

许多认真对待并行性的人都忽略了这些玩具 4 核甚至 32 核机器,转而青睐具有 128 个或更多处理器的 GPU。 我的猜测是,真正的行动将会在那里。

My research work includes work on compilers and on spam filtering. I also do a lot of 'personal productivity' Unix stuff. Plus I write and use software to administer classes that I teach, which includes grading, testing student code, tracking grades, and myriad other trivia.

  1. Multicore affects me not at all except as a research problem for compilers to support other applications. But those problems lie primarily in the run-time system, not the compiler.
  2. At great trouble and expense, Dave Wortman showed around 1990 that you could parallelize a compiler to keep four processors busy. Nobody I know has ever repeated the experiment. Most compilers are fast enough to run single-threaded. And it's much easier to run your sequential compiler on several different source files in parallel than it is to make your compiler itself parallel. For spam filtering, learning is an inherently sequential process. And even an older machine can learn hundreds of messages a second, so even a large corpus can be learned in under a minute. Again, training is fast enough.
  3. The only significant way I have of exploiting parallel machines is using parallel make. It is a great boon, and big builds are easy to parallelize. Make does almost all the work automatically. The only other thing I can remember is using parallelism to time long-running student code by farming it out to a bunch of lab machines, which I could do in good conscience because I was only clobbering a single core per machine, so using only 1/4 of CPU resources. Oh, and I wrote a Lua script that will use all 4 cores when ripping MP3 files with lame. That script was a lot of work to get right.
  4. I will ignore tens, hundreds, and thousands of cores. The first time I was told "parallel machines are coming; you must get ready" was 1984. It was true then and is true today that parallel programming is a domain for highly skilled specialists. The only thing that has changed is that today manufacturers are forcing us to pay for parallel hardware whether we want it or not. But just because the hardware is paid for doesn't mean it's free to use. The programming models are awful, and making the thread/mutex model work, let alone perform well, is an expensive job even if the hardware is free. I expect most programmers to ignore parallelism and quietly get on about their business. When a skilled specialist comes along with a parallel make or a great computer game, I will quietly applaud and make use of their efforts. If I want performance for my own apps I will concentrate on reducing memory allocations and ignore parallelism.
  5. Parallelism is really hard. Most domains are hard to parallelize. A widely reusable exception like parallel make is cause for much rejoicing.

Summary (which I heard from a keynote speaker who works for a leading CPU manufacturer): the industry backed into multicore because they couldn't keep making machines run faster and hotter and they didn't know what to do with the extra transistors. Now they're desperate to find a way to make multicore profitable because if they don't have profits, they can't build the next generation of fab lines. The gravy train is over, and we might actually have to start paying attention to software costs.

Many people who are serious about parallelism are ignoring these toy 4-core or even 32-core machines in favor of GPUs with 128 processors or more. My guess is that the real action is going to be there.

甩你一脸翔 2024-07-17 09:40:19

对于 Web 应用程序来说,这非常非常简单:忽略它。 除非您有一些确实需要并行完成的代码,否则您可以简单地编写旧式单线程代码并感到高兴。

在任何给定时刻,您需要处理的请求通常比您拥有的核心数量多得多。 由于每个线程都是在自己的线程(甚至进程,取决于您的技术)中处理,因此这已经是并行工作了。

唯一需要小心的地方是访问某种需要同步的全局状态时。 将其保持在最低限度,以避免将人为瓶颈引入到其他(几乎)完美可扩展的世界中。

因此,对我来说,多核基本上可以归结为以下几项:

  • 我的服务器拥有较少的“CPU”,而每台服务器拥有更多的核心(对我来说没有太大区别)
  • 相同数量的 CPU 可以维持更大数量的并发
  • 用户似乎是性能瓶颈,而不是 CPU 100% 加载的结果,那么这表明我在某处进行了一些错误的同步。

For web applications it's very, very easy: ignore it. Unless you've got some code that really begs to be done in parallel you can simply write old-style single-threaded code and be happy.

You usually have a lot more requests to handle at any given moment than you have cores. And since each one is handled in its own Thread (or even process, depending on your technology) this is already working in parallel.

The only place you need to be careful is when accessing some kind of global state that requires synchronization. Keep that to a minimum to avoid introducing artificial bottlenecks to an otherwise (almost) perfectly scalable world.

So for me multi-core basically boils down to these items:

  • My servers have less "CPUs" while each one sports more cores (not much of a difference to me)
  • The same number of CPUs can substain a larger amount of concurrent users
  • When the seems to be performance bottleneck that's not the result of the CPU being 100% loaded, then that's an indication that I'm doing some bad synchronization somewhere.
洒一地阳光 2024-07-17 09:40:19
  1. 目前 - 老实说,影响不大。 我更多地处于“准备阶段”,学习使这成为可能的技术和语言功能。
  2. 我没有一个特定的领域,但我遇到过诸如数学(其中多核至关重要)、数据排序/搜索(其中多核上的分而治之很有帮助)和多计算机要求(例如,要求备份站的处理能力用于某事)。
  3. 这取决于我使用的语言。 显然,在 C# 中,我的双手被尚未准备好的并行扩展实现所束缚,该实现似乎确实提高了性能,直到您开始将相同的算法与 OpenMP 进行比较(也许不是一个公平的比较)。 因此,在 .NET 上,通过一些 forParallel.For 重构等操作将会变得很轻松。
    事情变得真正 有趣的是 C++,因为与 .NET 相比,您可以从 OpenMP 之类的东西中挤出的性能是惊人的。 事实上,OpenMP 让我感到非常惊讶,因为我没想到它的工作效率如此之高。 嗯,我猜它的开发者花了很多时间来打磨它。 我还喜欢它在 Visual Studio 中开箱即用,这与您必须付费的 TBB 不同。
    至于 MPI,我使用 PureMPI.net 进行小型家庭项目(我有 LAN)玩弄一台机器无法完成的计算。 我从未在商业上使用过 MPI,但我确实知道 MKL 有一些 MPI 优化的函数,对于任何需要它们的人来说可能会很有趣。
  4. 我计划进行“琐碎计算”,即使用额外的核心来预先计算可能需要或可能不需要的结果 - 当然,RAM 允许。 我还打算深入研究大多数最终用户的机器目前无法处理的昂贵算法和方法。
  5. 至于无法从并行化中受益的领域……好吧,人们总能找到一些东西。 我担心的一件事是 .NET 中的良好支持,尽管遗憾的是我已经放弃了能够达到与 C++ 类似的速度的希望。
  1. At the moment - doesn't affect it that much, to be honest. I'm more in 'preparation stage', learning about the technologies and language features that make this possible.
  2. I don't have one particular domain, but I've encountered domains like math (where multi-core is essential), data sort/search (where divide & conquer on multi-core is helpful) and multi-computer requirements (e.g., a requirement that a back-up station's processing power is used for something).
  3. This depends on what language I'm working. Obviously in C#, my hands are tied with a not-yet-ready implementation of Parallel Extensions that does seem to boost performance, until you start comparing same algorithms with OpenMP (perhaps not a fair comparison). So on .NET it's going to be an easy ride with some forParallel.For refactorings and the like.
    Where things get really interesting is with C++, because the performance you can squeeze out of things like OpenMP is staggering compared to .NET. In fact, OpenMP surprised me a lot, because I didn't expect it to work so efficiently. Well, I guess its developers have had a lot of time to polish it. I also like that it is available in Visual Studio out-of-the-box, unlike TBB for which you have to pay.
    As for MPI, I use PureMPI.net for little home projects (I have a LAN) to fool around with computations that one machine can't quite take. I've never used MPI commercially, but I do know that MKL has some MPI-optimized functions, which might be interesting to look at for anyone needing them.
  4. I plan to do 'frivolous computing', i.e. use extra cores for precomputation of results that might or might not be needed - RAM permitting, of course. I also intend to delve into costly algorithms and approaches that most end users' machines right now cannot handle.
  5. As for domains not benefitting from parallellization... well, one can always find something. One thing I am concerned about is decent support in .NET, though regrettably I have given up hope that speeds similar to C++ can be attained.
倒数 2024-07-17 09:40:19

我从事医学成像和图像处理工作。

我们处理多核的方式与处理单核的方式大致相同——我们编写的应用程序中已经有多个线程,以便拥有响应式 UI。

然而,因为我们现在可以,所以我们正在认真考虑在 CUDA 或 OpenMP 中实现大部分图像处理操作。 英特尔编译器为 OpenMP 提供了许多优秀的示例代码,并且是比 CUDA 成熟得多的产品,并且提供了更大的安装基础,因此我们可能会采用它。

如果可以的话,对于昂贵的(即超过一秒)操作,我们倾向于将该操作分叉到另一个进程中。 这样,主 UI 就能保持响应。 如果我们不能,或者移动这么多内存太不方便或太慢,则该操作仍然在一个线程中,然后该操作本身可以生成多个线程。

我们的关键是确保我们不会遇到并发瓶颈。 我们使用 .NET 进行开发,这意味着 UI 更新必须通过对 UI 的 Invoke 调用来完成,以便让主线程更新 UI。

也许我很懒,但实际上,当涉及到矩阵求逆等并行化时,我不想花太多时间来解决很多这些问题。 很多非常聪明的人花了很多时间让这种东西像亚氮一样快速,我只是想接受他们所做的并称之为它。 像 CUDA 这样的东西有一个有趣的图像处理接口(当然,这就是它的定义目的),但对于那种即插即用的编程来说它仍然太不成熟。 如果我或其他开发人员有很多空闲时间,我们可能会尝试一下。 因此,我们将使用 OpenMP 来加快处理速度(这肯定会出现在接下来几个月的开发路线图上)。

I work in medical imaging and image processing.

We're handling multiple cores in much the same way we handled single cores-- we have multiple threads already in the applications we write in order to have a responsive UI.

However, because we can now, we're taking strong looks at implementing most of our image processing operations in either CUDA or OpenMP. The Intel Compiler provides a lot of good sample code for OpenMP, and is just a much more mature product than CUDA, and provides a much larger installed base, so we're probably going to go with that.

What we tend to do for expensive (ie, more than a second) operations is to fork that operation off into another process, if we can. That way, the main UI remains responsive. If we can't, or it's just far too inconvenient or slow to move that much memory around, the operation is still in a thread, and then that operation can itself spawn multiple threads.

The key for us is to make sure that we don't hit concurrency bottlenecks. We develop in .NET, which means that UI updates have to be done from an Invoke call to the UI in order to have the main thread update the UI.

Maybe I'm lazy, but really, I don't want to have to spend too much time figuring a lot of this stuff out when it comes to parallelizing things like matrix inversions and the like. A lot of really smart people have spent a lot of time making that stuff fast like nitrous, and I just want to take what they've done and call it. Something like CUDA has an interesting interface for image processing (of course, that's what it's defined for), but it's still too immature for that kind of plug-and-play programming. If I or another developer get a lot of spare time, we might give it a try. So instead, we'll just go with OpenMP to make our processing faster (and that's definitely on the development roadmap for the next few months).

向日葵 2024-07-17 09:40:19

到目前为止,无非是使用 make 进行更高效的编译:

gmake -j

-j 选项允许不依赖于彼此的任务并行运行。

So far, nothing more than more efficient compilation with make:

gmake -j

the -j option allows tasks that don't depend on one another to run in parallel.

最终幸福 2024-07-17 09:40:19

我正在开发 ASP.NET Web 应用程序。 在我的代码中直接使用多核的可能性很小,但是 IIS 已经通过在负载下生成多个工作线程/进程来很好地针对多个核心/CPU 进行扩展。

I'm developing ASP.NET web applications. There is little possibility to use multicore directly in my code, however IIS already scales well for multiple cores/CPU's by spawning multiple worker threads/processes when under load.

还在原地等你 2024-07-17 09:40:19

我们使用 F# 在 .NET 4 中的任务并行性方面取得了巨大成功。 我们的客户迫切需要多核支持,因为他们不希望他们的 n-1 核闲置!

We're having a lot of success with task parallelism in .NET 4 using F#. Our customers are crying out for multicore support because they don't want their n-1 cores idle!

↙温凉少女 2024-07-17 09:40:19

我是搞图像处理的。 我们通过处理分配给不同线程的切片图像来尽可能利用多核。

I'm in image processing. We're taking advantage of multicore where possible by processing images in slices doled out to different threads.

空袭的梦i 2024-07-17 09:40:19

我在回答另一个问题时说了其中一些内容(希望这没问题!):有一个概念/方法称为 基于流的编程 (FBP) 已经存在了 30 多年,加拿大一家主要银行正在使用它来处理大部分批处理。 它在 Java 和 C# 中具有基于线程的实现,尽管早期的实现是基于纤程的(C++ 和大型机汇编器)。 解决利用多核问题的大多数方法都涉及尝试采用传统的单线程程序并找出哪些部分可以并行运行。 FBP 采用了不同的方法:应用程序从一开始就根据多个异步运行的“黑盒”组件进行设计(想象一下制造装配线)。 由于组件之间的接口是数据流,FBP本质上是独立于语言的,因此支持混合语言应用程序和特定于领域的语言。 人们发现,以这种方式编写的应用程序比传统的单线程应用程序更易于维护,并且通常花费更少的时间,即使在单核计算机上也是如此。

I said some of this in answer to a different question (hope this is OK!): there is a concept/methodology called Flow-Based Programming (FBP) that has been around for over 30 years, and is being used to handle most of the batch processing at a major Canadian bank. It has thread-based implementations in Java and C#, although earlier implementations were fiber-based (C++ and mainframe Assembler). Most approaches to the problem of taking advantage of multicore involve trying to take a conventional single-threaded program and figure out which parts can run in parallel. FBP takes a different approach: the application is designed from the start in terms of multiple "black-box" components running asynchronously (think of a manufacturing assembly line). Since the interface between components is data streams, FBP is essentially language-independent, and therefore supports mixed-language applications, and domain-specific languages. Applications written this way have been found to be much more maintainable than conventional, single-threaded applications, and often take less elapsed time, even on single-core machines.

半﹌身腐败 2024-07-17 09:40:19

我的研究生工作是开发裸机多核工作和应用的概念。 在嵌入式系统中进行同样的教学。

我还使用 F# 进行一些工作,以加快我的高级多处理语言设施的速度。

My graduate work is in developing concepts for doing bare-metal multicore work & teaching same in embedded systems.

I'm also working a bit with F# to bring up my high-level multiprocess-able language facilities to speed.

旧人九事 2024-07-17 09:40:19

我们创建VivaMP 代码分析器,用于并行 OpenMP 程序中的错误检测。

VivaMP 是一种类似于 lint 的静态 C/C++ 代码分析器,旨在指示基于 OpenMP 技术的并行程序中的错误。 VivaMP 静态分析器大大增强了现有编译器的功能,可以诊断任何存在错误或此类错误的最终来源的并行代码。 该分析器集成到VisualStudio2005/2008开发环境中。

VivaMP – OpenMP 工具

针对 C++ 开发人员的 32 个 OpenMP 陷阱

We create the VivaMP code analyzer for error detecting in parallel OpenMP programs.

VivaMP is a lint-like static C/C++ code analyzer meant to indicate errors in parallel programs based on OpenMP technology. VivaMP static analyzer adds much to the abilities of the existing compilers, diagnoses any parallel code which has some errors or is an eventual source of such errors. The analyzer is integrated into VisualStudio2005/2008 development environment.

VivaMP – a tool for OpenMP

32 OpenMP Traps For C++ Developers

离旧人 2024-07-17 09:40:19

我相信“循环是工程师最好的朋友”。

我公司提供了一个商业工具来分析
并转变非常
多种计算机语言的大型软件系统。
“大”意味着 10-3000 万行代码。
该工具是 DMS 软件重组工具包
(简称DMS)。

对如此庞大的系统进行分析(甚至转换)
需要很长时间:我们的 C 点分析器
代码在具有 16 Gb RAM 的 x86-64 上需要 90 个 CPU 小时。
工程师希望更快地得到答案。

因此,我们在 PARLANSE 中实施了 DMS,
我们自己设计的并行编程语言,
旨在利用小规模多核共享
记忆系统。

parlanse 背后的关键思想是:
a) 让程序员暴露并行性,
b) 让编译器选择它可以实现的部分,
c) 将上下文切换保持在绝对最低限度。
计算的静态偏序是
轻松帮助实现这三点; 说起来容易,
相对容易衡量成本,
便于编译器安排计算。
(用它编写并行快速排序很简单)。

不幸的是,我们在 1996 年就这么做了:-(
过去几年终于证明了这一点。
我现在可以在 Fry's 以不到 1000 美元的价格购买 8 核机器
24核机器的价格与小型机器差不多
汽车(并且可能会迅速下降)。

好消息是 DMS 现在已经相当成熟,
还有一些关键的内部机制
在 DMS 中利用了这一点,特别是
一整类分析器称为“属性语法”,
我们使用特定于领域的语言编写
这不是语法。 DMS 编译这些
将语法赋予 PARLANSE,然后它们
是并行执行的。 我们的 C++ 前端
end 使用属性语法,大约100K
斯洛克; 它被编译成并行的800K SLOC
实际上可靠工作的 parlanse 代码。

现在(2009 年 6 月),我们正忙于使 DMS 变得有用,并且
并不总是有足够的时间来利用并行性
出色地。 因此90小时的分析。
我们正在努力使其并行化,并且
有 10-20 倍加速的合理希望。

我们相信,从长远来看,利用
SMP 将使工作站变得更加强大
对提出棘手问题的工程师很友好。
他们也应该这样做。

I believe that "Cycles are an engineers' best friend".

My company provides a commercial tool for analyzing
and transforming very
large software systems in many computer languages.
"Large" means 10-30 million lines of code.
The tool is the DMS Software Reengineering Toolkit
(DMS for short).

Analyses (and even transformations) on such huge systems
take a long time: our points-to analyzer for C
code takes 90 CPU hours on an x86-64 with 16 Gb RAM.
Engineers want answers faster than that.

Consequently, we implemented DMS in PARLANSE,
a parallel programming language of our own design,
intended to harness small-scale multicore shared
memory systems.

The key ideas behind parlanse are:
a) let the programmer expose parallelism,
b) let the compiler choose which part it can realize,
c) keep the context switching to an absolute minimum.
Static partial orders over computations are
an easy to help achieve all 3; easy to say,
relatively easy to measure costs,
easy for compiler to schedule computations.
(Writing parallel quicksort with this is trivial).

Unfortunately, we did this in 1996 :-(
The last few years have finally been a vindication;
I can now get 8 core machines at Fry's for under $1K
and 24 core machines for about the same price as a small
car (and likely to drop rapidly).

The good news is that DMS is now a fairly mature,
and there are a number of key internal mechanisms
in DMS which take advantage of this, notably
an entire class of analyzers call "attribute grammars",
which we write using a domain-specific language
which is NOT parlanse. DMS compiles these
atrribute grammars into PARLANSE and then they
are executed in parallel. Our C++ front
end uses attribute grammars, and is about 100K
sloc; it is compiled into 800K SLOC of parallel
parlanse code that actually works reliably.

Now (June 2009), we are pretty busy making DMS useful, and
don't always have enough time to harness the parallelism
well. Thus the 90 hour points-to analysis.
We are working on parallelizing that, and
have reasonable hope of 10-20x speedup.

We believe that in the long run, harnessing
SMP well will make workstations far more
friendly to engineers asking hard questions.
As well they should.

一江春梦 2024-07-17 09:40:19

我们的领域逻辑很大程度上基于工作流引擎,每个工作流实例都在线程池上运行。

这对我们来说已经足够好了。

Our domain logic is based heavily on a workflow engine and each workflow instance runs off the ThreadPool.

That's good enough for us.

彻夜缠绵 2024-07-17 09:40:19

我现在可以将我的主操作系统与我的开发分开/使用 Virtual PC 或 VMWare 的虚拟化设置安装我喜欢的任何操作系统。

双核意味着一个 CPU 运行我的主机操作系统,另一个 CPU 运行我的开发操作系统,并具有良好的性能水平。

I can now separate my main operating system from my development / install whatever I like os using vitualisation setups with Virtual PC or VMWare.

Dual core means that one CPU runs my host OS, the other runs my development OS with a decent level of performance.

风蛊 2024-07-17 09:40:19

学习函数式编程语言可能会使用多个内核……成本高昂。

我认为使用额外的核心并不难。 作为 Web 应用程序,有一些琐碎的事情不需要任何额外的关注,因为 Web 服务器会并行运行查询。 这些问题是针对长时间运行的算法(长就是你所说的长)。 这些需要分割成彼此不依赖的较小域,或者同步依赖关系。 很多算法都可以做到这一点,但有时需要非常不同的实现(再次成本)。

因此,抱歉,除非您使用命令式编程语言,否则没有灵丹妙药。 您要么需要熟练的程序员(成本高昂),要么需要转向其他编程语言(成本高昂)。 或者你可能只是运气好(网络)。

Learning a functional programming language might use multiple cores... costly.

I think it's not really hard to use extra cores. There are some trivialities as web apps that does not need to have any extra care as the web server does its work running the queries in parallel. The questions are for long running algorythms (long is what you call long). These need to be split over smaller domains that does not depend each other, or synchronize the dependencies. A lot of algs can do this, but sometimes horribly different implementations needed (costs again).

So, no silver bullet until you are using imperative programming languages, sorry. Either you need skilled programmers (costly) or you need to turn to an other programming language (costly). Or you may have luck simply (web).

过期以后 2024-07-17 09:40:19

我正在 Mac 上使用和编程。 大中央调度获胜。 Ars Technica 评论Snow Leopard 的关于多核编程以及人们(或至少苹果)的发展方向有很多有趣的事情要说。

I'm using and programming on a Mac. Grand Central Dispatch for the win. The Ars Technica review of Snow Leopard has a lot of interesting things to say about multicore programming and where people (or at least Apple) are going with it.

別甾虛僞 2024-07-17 09:40:19

我决定在 DEFLATE 算法的实现中利用多核的优势。 MArc Adler 在 C 代码中使用 PIGZ (并行 gzip)做了类似的事情。 我已经在 DotNetZip v1.9 中提供了哲学上的等效内容,但在托管代码库中。 这不是 PIGZ 的移植,而是类似的想法,独立实现。

DEFLATE 背后的想法是扫描数据块,查找重复序列,构建一个“字典”,将短“代码”映射到每个重复序列,然后发出一个字节流,其中重复序列之一的每个实例被字典中的“代码”替换。

因为构建字典是 CPU 密集型的,所以 DEFLATE 是并行化的完美候选者。 我采用了 Map+Reduce 类型的方法,将传入的未压缩字节树划分为一组较小的块(映射),每个块 64k,然后独立压缩它们。 然后我将生成的块连接在一起(减少)。 每个 64k 块在其自己的线程上独立压缩,而不考虑其他块。

在双核机器上,这种方法所压缩的时间大约是传统串行方法的 54%。 在服务器级机器上,如果有更多可用内核,它可能会提供更好的结果; 由于没有服务器机器,我没有亲自测试过它,但人们告诉我它很快。


存在与多个线程的管理相关的运行时 (CPU) 开销、与每个线程的缓冲区相关的运行时内存开销,以及与连接块相关的数据开销。 因此,这种方法仅适用于较大的字节流。 在我的测试中,超过 512k 就可以得到回报。 在此之下,最好使用串行方法。


DotNetZip 作为库提供。 我的目标是让这一切变得透明。 因此,当缓冲区超过 512kb 时,库会自动使用额外的线程。 应用程序无需执行任何操作即可使用线程。 它只是工作,并且当使用线程时,它会神奇地更快。 我认为对于应用程序使用的大多数库来说,这是一种合理的方法。


对于计算机来说,能够自动、动态地利用可并行算法上的资源是件好事,但当今的现实是应用程序设计者必须显式地编写并行化代码。


I've decided to take advantage of multiple cores in an implementation of the DEFLATE algorithm. MArc Adler did something similar in C code with PIGZ (parallel gzip). I've delivered the philosophical equivalent, but in a managed code library, in DotNetZip v1.9. This is not a port of PIGZ, but a similar idea, implemented independently.

The idea behind DEFLATE is to scan a block of data, look for repeated sequences, build a "dictionary" that maps a short "code" to each of those repeated sequences, then emit a byte stream where each instance of one of the repeated sequences is replaced by a "code" from the dictionary.

Because building the dictionary is CPU intensive, DEFLATE is a perfect candidate for parallelization. i've taken a Map+Reduce type approach, where I divide the incoming uncompressed bytestreeam into a set of smaller blocks (map), say 64k each, and then compress those independently. Then I concatenate the resulting blocks together (reduce). Each 64k block is compressed independently, on its own thread, without regard for the other blocks.

On a dual-core machine, this approach compresses in about 54% of the time of the traditional serial approach. On server-class machines, with more cores available, it can potentially deliver even better results; with no server machine, I haven't tested it personally, but people tell me it's fast.


There's runtime (cpu) overhead associated to the management of multiple threads, runtime memory overhead associated to the buffers for each thead, and data overhead associated to concatenating the blocks. So this approach pays off only for larger bytestreams. In my tests, above 512k, it can pay off. Below that, it is better to use a serial approach.


DotNetZip is delivered as a library. My goal was to make all of this transparent. So the library automatically uses the extra threads when the buffer is above 512kb. There's nothing the application has to do, in order to use threads. It just works, and when threads are used, it's magically faster. I think this is a reasonable approach to take for most libbraries being consumed by applications.


It would be nice for the computer to be smart about automatically and dynamically exploiting resources on parallizable algorithms, but the reality today is that apps designers have to explicitly code the parallelization in.


-黛色若梦 2024-07-17 09:40:19

我使用 C# 和 .Net 线程进行工作。
您可以将面向对象的封装与线程管理结合起来。

我读过 Peter 的一些帖子,谈论 Packt Publishing 的一本新书,并且在 Packt Publishing 网页中找到了以下文章:

http://www.packtpub.com/article/simplifying-parallelism-complexity-c-sharp

我读过 Joe Duffy 的《Windows 并发编程》书。 现在,我正在等待 Hillar 的书“C# 2008 和 2005 线程编程” - http: //www.amazon.com/2008-2005-Threaded-Programming-Beginners/dp/1847197108/ref=pd_rhf_p_t_2

我同意Szundi的“没有灵丹妙药”!

I work in C# with .Net Threads.
You can combine object-oriented encapsulation with Thread management.

I've read some posts from Peter talking about a new book from Packt Publishing and I've found the following article in Packt Publishing web page:

http://www.packtpub.com/article/simplifying-parallelism-complexity-c-sharp

I've read Concurrent Programming with Windows, Joe Duffy's book. Now, I am waiting for "C# 2008 and 2005 Threaded Programming", Hillar's book - http://www.amazon.com/2008-2005-Threaded-Programming-Beginners/dp/1847197108/ref=pd_rhf_p_t_2

I agree with Szundi "No silver bullet"!

樱花坊 2024-07-17 09:40:19

你会说“对于 Web 应用程序来说,这非常非常简单:忽略它。除非你有一些真正需要并行完成的代码,否则你可以简单地编写旧式单线程代码并感到高兴。”

我正在使用 Web 应用程序,并且确实需要充分利用并行性。
我明白你的意思。 然而,我们必须为多核革命做好准备。 忽视它与忽视 90 年代的 GUI 革命是一样的。

我们不是还在为DOS开发吗? 我们必须解决多核问题,否则我们将在很多年后死去。

You say "For web applications it's very, very easy: ignore it. Unless you've got some code that really begs to be done in parallel you can simply write old-style single-threaded code and be happy."

I am working with Web applications and I do need to take full advantage of parallelism.
I understand your point. However, we must prepare for the multicore revolution. Ignoring it is the same than ignoring the GUI revolution in the 90's.

We are not still developing for DOS? We must tackle multicore or we'll be dead in many years.

说不完的你爱 2024-07-17 09:40:19

我认为这种趋势首先会说服一些开发人员,然后他们中的大多数人都会看到并行化是一项非常复杂的任务。
我期望某种设计模式能够解决这种复杂性。 不是低级的,而是架构模式,这将很难做错事。

例如,我预计消息传递模式会受到欢迎,因为它本质上是异步的,但您不会考虑死锁或互斥或其他问题。

I think this trend will first persuade some developers, and then most of them will see that parallelization is a really complex task.
I expect some design pattern to come to take care of this complexity. Not low level ones but architectural patterns which will make hard to do something wrong.

For example I expect messaging patterns to gain popularity, because it's inherently asynchronous, but you don't think about deadlock or mutex or whatever.

软甜啾 2024-07-17 09:40:19
  1. 这对您的软件路线图有何影响?
    事实并非如此。 我们(与几乎所有其他)业务相关的应用程序在单核上运行得很好。 只要添加更多内核不会显着降低单线程应用的性能,我们就很高兴

  2. ...真实的故事...
    和其他人一样,并行构建是我们获得的主要好处。 Visual Studio 2008 C# 编译器似乎并没有使用多个内核,这确实很糟糕

  3. 您如何利用现有代码来利用多核计算机
    如果我们有一个可以并行化的长时间运行的算法,我们可能会考虑使用 .NET 并行扩展,但实际发生这种情况的可能性很小。 最可能的答案是,一些开发人员会出于兴趣而使用它,但除此之外没有其他用途

  4. 您将如何处理数百或数千个核心?
    头-> 沙子。

  5. 如果您的领域不容易从并行计算中受益,那么解释原因也很有趣。
    客户端应用程序主要推送数据,服务器应用程序主要依赖 SQL Server 来完成繁重的工作

  1. How does this affect your software roadmap?
    It doesn't. Our (as with almost all other) business related apps run perfectly well on a single core. So long as adding more cores doesn't significantly reduce the performance of single threaded apps, we're happy

  2. ...real stories...
    Like everyone else, parallel builds are the main benefit we get. The Visual Studio 2008 C# compiler doesn't seem to use more than one core though, which really sucks

  3. What are you doing with your existing code to take advantage of multicore machines
    We may look into using the .NET parallel extensions if we ever have a long-running algorithm that can be parallelized, but the odds of this actually occurring are slim. The most likely answer is that some of the developers will play around with it for interest's sake, but not much else

  4. how will you deal with hundreds or thousands of cores?
    Head -> Sand.

  5. If your domain doesn't easily benefit from parallel computation, then explaining why is interesting, too.
    The client app mostly pushes data around, the server app mostly relies on SQL server to do the heavy lifting

漫雪独思 2024-07-17 09:40:19

我正在利用使用 C、PThreads 的多核,以及使用 PREEMPT_RT 补丁集调度程序在 OpenVPX 平台上与 Linux 进行通信顺序进程的自制实现。 所有这些加在一起,多个操作系统实例的 CPU 利用率接近 100%,OpenVPX 机箱中的处理器卡之间没有 CPU 时间用于数据交换,并且延迟也非常低。 还使用 sFPDP 将多个 OpenVPX 机箱连接到一台机器中。 我没有使用 Xeon 的内部 DMA 来缓解 CPU 内部的内存压力(DMA 仍然使用内存带宽,但会牺牲 CPU 内核的性能)。 相反,我们将数据保留在适当的位置,并以 CSP 方式传递其所有权(因此与 .NET 任务并行数据流库的原理没有什么不同)。

1) 软件路线图 - 我们面临着最大化利用空间和可用电力的压力。 充分利用最新的硬件至关重要

2) 软件领域 - 有效地进行科学计算

3) 我们正在使用现有代码做什么? 不断地将其分解并跨线程重新分配其部分内容,以便每个核心都在不破坏实时要求的情况下最大限度地发挥作用。 新硬件意味着大量的重新思考(更快的核心可以在给定时间内完成更多工作,不希望它们得到充分利用)。 并不像听起来那么糟糕 - 核心例程非常模块化,因此很容易组装成线程大小的块。 尽管我们计划从 Linux 中夺走线程亲和力的控制权,但我们尚未设法通过这样做来获得显着的额外性能。 Linux 非常擅长在或多或少相同的地方获取数据和代码。

4) 实际上已经存在 - 整个机器已经添加了数千个核心

5) 并行计算至关重要 - 它是一个 MISD 系统。

如果这听起来像是一项繁重的工作,确实如此。 有些工作需要全力以赴地充分利用可用的硬件,并避开几乎所有高级的东西。 我们发现机器的总体性能是 CPU 内存带宽的函数,而不是 CPU 核心速度、L1/L2/L3 缓存大小的函数。

I'm taking advantage of multicore using C, PThreads, and a home brew implementation of Communicating Sequential Processes on an OpenVPX platform with Linux using the PREEMPT_RT patch set's scheduler. It all adds up to nearly 100% CPU utilisation across multiple OS instances with no CPU time used for data exchange between processor cards in the OpenVPX chassis, and very low latency too. Also using sFPDP to join multiple OpenVPX chassis together into a single machine. Am not using Xeon's internal DMA so as to relieve memory pressure inside CPUs (DMA still uses memory bandwidth at the expense of the CPU cores). Instead we're leaving data in place and passing ownership of it around in a CSP way (so not unlike the philosophy of .NET's task parallel data flow library).

1) Software Roadmap - we have pressure to maximise the use real estate and available power. Making the very most of the latest hardware is essential

2) Software domain - effectively Scientific Computing

3) What we're doing with existing code? Constantly breaking it apart and redistributing parts of it across threads so that each core is maxed out doing the most it possibly can without breaking out real time requirement. New hardware means quite a lot of re-thinking (faster cores can do more in the given time, don't want them to be under utilised). Not as bad as it sounds - the core routines are very modular so easily assembled into thread-sized lumps. Although we planned on taking control of thread affinity away from Linux, we've not yet managed to extract significant extra performance by doing so. Linux is pretty good at getting data and code in more or less the same place.

4) In effect already there - total machine already adds up to thousands of cores

5) Parallel computing is essential - it's a MISD system.

If that sounds like a lot of work, it is. some jobs require going whole hog on making the absolute most of available hardware and eschewing almost everything that is high level. We're finding that the total machine performance is a function of CPU memory bandwidth, not CPU core speed, L1/L2/L3 cache size.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文