我正在对多核处理器进行一些研究;具体来说,我正在研究为多核处理器编写代码以及为多核处理器编译代码。
我对该领域的主要问题感到好奇,这些问题目前阻碍了编程技术和实践的广泛采用,以充分利用多核架构的力量。
我知道以下工作(其中一些似乎与多核架构没有直接关系,但似乎与并行编程模型、多线程和并发性有更多关系):
- Erlang (我知道 Erlang 包含促进并发的结构,但我不确定它是如何用于多核的架构)
- OpenMP(似乎主要与多处理和利用集群的力量有关)
- 统一并行 C
- Cilk
- Intel 线程块 (这似乎是直接相关的对于多核系统;这是有道理的,因为它来自英特尔。除了定义某些编程结构之外,它似乎还具有告诉编译器优化多核架构代码的功能)
一般来说,根据我在多线程编程方面的一点经验,我知道编程考虑到并发和并行性绝对是一个困难的概念。我还知道多线程编程和多核编程是两个不同的东西。在多线程编程中,您要确保 CPU 不会保持空闲状态(在单 CPU 系统上)。正如 James 指出的那样,操作系统可以安排不同的线程在不同的内核上运行——但我更感兴趣的是描述来自语言本身,或通过编译器)。据我所知,你不能真正进行并行操作。在多核系统中,您应该能够执行真正的并行操作。
因此,在我看来,目前多核编程面临的问题是:
- 多核编程是一个困难的概念,需要大量技能
- 当今的编程语言中没有原生结构可以为多核环境的编程提供良好的抽象
- 除了英特尔的 TBB 库之外, 尚未发现其他编程语言能够利用多核架构的强大功能进行编译(例如,我不知道 Java 或 C# 编译器是否优化了多核系统的字节码,或者即使 JIT 编译器也这样做
)我有兴趣了解可能存在哪些其他问题,以及是否有任何解决方案可以解决这些问题。研究论文(以及类似性质的东西)的链接会很有帮助。谢谢!
编辑
如果我必须将我的问题浓缩为一句话,那就是:当今多核编程面临哪些问题以及该领域正在进行哪些研究来解决这些问题?
更新
在我看来,多核需要关注三个级别:
- 语言级别:抽象并行化和并发性并使其易于实现的构造/概念/框架程序员表达相同的
- 编译器级别:如果编译器知道它正在编译的架构,它可以优化该架构的编译代码。
- 操作系统级别:操作系统优化运行进程,并可能调度不同的线程/进程在不同的内核上运行。
我搜索了 ACM 和 IEEE,找到了一些论文。他们中的大多数人都谈到并发思考是多么困难,以及当前的语言如何没有适当的方法来表达并发性。有些人甚至声称我们当前的并发模型(线程)并不是处理并发的好方法(即使在多个核心上)。我有兴趣听听其他观点。
I'm doing some research on multicore processors; specifically I'm looking at writing code for multicore processors and also compiling code for multicore processors.
I'm curious about the major problems in this field that would currently prevent a widespread adoption of programming techniques and practices to fully leverage the power of multicore architectures.
I am aware of the following efforts (some of these don't seem directly related to multicore architectures, but seem to have more to do with parallel-programming models, multi-threading, and concurrency):
- Erlang (I know that Erlang includes constructs to facilitate concurrency, but I am not sure how exactly it is being leveraged for multicore architectures)
- OpenMP (seems mostly related to multiprocessing and leveraging the power of clusters)
- Unified Parallel C
- Cilk
- Intel Threading Blocks (this seems to be directly related to multicore systems; makes sense as it comes from Intel. In addition to defining certain programming-constructs, it also seems have features that tell the compiler to optimize the code for multicore architectures)
In general, from what little experience I have with multithreaded programming, I know that programming with concurrency and parallelism in mind is definitely a difficult concept. I am also aware that multithreaded programming and multicore programming are two different things. in multithreaded programming you are ensuring that the CPU does not remain idle (on a single-CPU system. As James pointed out the OS can schedule different threads to run on different cores -- but I'm more interested in describing the parallel operations from the language itself, or via the compiler). As far as I know you cannot truly do parallel operations. In multicore systems, you should be able to perform truly-parallel operations.
So it seems to me that currently the problems facing multicore programming are:
- Multicore programming is a difficult concept that requires significant skill
- There are no native constructs in today's programming languages that provide a good abstraction to program for a multicore environment
- Other than Intel's TBB library I haven't found efforts in other programming-languages to leverage the power of multicore architectures for compilation (for example, I don't know if the Java or C# compiler optimizes the bytecode for multicore systems or even if the JIT compiler does that)
I'm interested in knowing what other problems there might be, and if there are any solutions in the works to address these problems. Links to research papers (and things of that nature) would be helpful. Thanks!
EDIT
If I had to condense my question down to one sentence, it would be this: What are the problems that face multicore programming today and what research is going on in the field to solve these problems?
UPDATE
It also seems to me that there are three levels where multicore needs to be concerned:
- Language level: Constructs/concepts/frameworks that abstract parallelization and concurrency and make it easy for programmers to express the same
- Compiler level: If the compiler is aware of what architecture it is compiling for, it can optimize the compiled code for that architecture.
- OS level: The OS optimizes the running process and perhaps schedules different threads/processes to run on different cores.
I've searched on ACM and IEEE and have found a few papers. Most of them talk about how difficult it is to think concurrently and also how current languages don't have a proper way to express concurrency. Some have gone so far as to claim that the current model of concurrency that we have (threads) is not a good way to handle concurrency (even on multiple cores). I'm interested in hearing other views.
发布评论
评论(5)
惯性。 (顺便说一句:这几乎是“什么阻碍了广泛采用”问题的答案,无论是并行编程模型、垃圾收集、类型安全还是节能汽车。)
我们已经知道自 20 世纪 60 年代以来,线程+锁模型从根本上被打破。到 1980 年左右,我们已经有了大约十几个更好的模型。然而,当今使用的绝大多数语言(包括 1980 年之后从头开始新创建的语言)仅提供线程+锁。
Inertia. (BTW: that's pretty much the answer to all "what does prevent the widespread adoption" questions, whether that be models of parallel programming, garbage collection, type safety or fuel-efficient automobiles.)
We have known since the 1960s that the threads+locks model is fundamentally broken. By ~1980, we had about a dozen better models. And yet, the vast majority of languages that are in use today (including languages that were newly created from scratch long after 1980), offer only threads+locks.
多核编程的主要问题与编写任何其他并发应用程序相同,但是以前计算机中具有多个 cpu 并不常见,但现在很难找到任何只有一个核心的现代计算机,因此,多核的优势,多CPU架构存在新的挑战。
但是,这个问题是一个老问题,每当计算机体系结构超越编译器时,后备解决方案似乎就是回到函数式编程,因为如果严格遵循该编程范式,就可以编写非常可并行的程序,因为您没有例如,任何全局可变变量。
但是,并不是所有问题都可以使用 FP 轻松解决,因此我们的目标是如何轻松地让其他编程范例易于在多核上使用。
首先,许多程序员都避免编写良好的多线程应用程序,因此没有足够多的开发人员做好充分准备,因为他们养成的习惯将使他们的编码变得更难。
但是,与 CPU 的大多数更改一样,您可以了解如何更改编译器,为此您可以查看 Scala、Haskell、Erlang 和 F#。
对于库,您可以查看 MS 的并行框架扩展,作为一种使并发编程变得更容易的方法。
它正在工作中,但我最近在 IEEE Spectrum 或 IEEE Computer 上发表了有关多核编程问题的文章,因此请查看 IEEE 和 ACM 就这些问题撰写的文章,以获得有关正在研究的内容的更多想法。
我认为最大的障碍是让程序员改变他们的语言很困难,因为 FP 与 OOP 有很大不同。
除了开发能够以这种方式良好运行的语言之外,研究的一个地方是如何处理访问内存的多个线程,但是,与该领域的许多内容一样,Haskell 似乎处于测试这一想法的最前沿,因此您可以查看Haskell 发生了什么事。
最终将会出现新的语言,并且可能我们有 DSL 来帮助开发人员进行更多抽象,但是如何对程序员进行这方面的教育将是一个挑战。
更新:
您可能会对第 24 章并发和多核编程感兴趣,http://book.realworldhaskell.org/read/concurrent-and-multicore-programming.html
The major problems with multicore programming is the same as writing any other concurrent applications, but whereas before it was uncommon to have multiple cpus in a computer, now it is hard to find any modern computer with only one core in it, so, to take advantage of multicore, multiple cpu architectures there are new challenges.
But, this problem is an old problem, whenever computer architectures go beyond compilers then it seems the fallback solution is to move back toward functional programming, as that programming paradigm, if strictly followed, can make very parallelizable programs, as you don't have any global mutable variables, for example.
But, not all problems can be done easily using FP, so the goal then is how to easily get other programming paradigms to be easy to use on multicores.
The first thing is that many programmers have avoided writing good mulithreaded applications, so there isn't a strongly prepared number of developers, as they learned habits that will make their coding harder to do.
But, as with most changes to the cpu, you can look at how to change the compiler, and for that you can look at Scala, Haskell, Erlang and F#.
For libraries you can look at the parallel framework extension, by MS as a way to make it easier to do concurrent programming.
It is at work, but I recently either IEEE Spectrum or IEEE Computer had articles on multicore programming issues, so look at what IEEE and ACM articles have been written on these issues, to get more ideas as to what is being looked at.
I think the biggest impediment will be the difficulty to get programmers to change their language as FP is very different than OOP.
One place for research besides developing languages that will work well this way, is how to handle multiple threads accessing memory, but, as with much in this area, Haskell seems to be at the forefront in testing ideas for this, so you can look at what is going on with Haskell.
Ultimately there will be new languages, and it may be that we have DSLs to help abstract the developer more, but how to educate programmers on this will be a challenge.
UPDATE:
You may find Chapter 24. Concurrent and multicore programming of interest, http://book.realworldhaskell.org/read/concurrent-and-multicore-programming.html
其中一个答案提到了 .NET Framework 的并行扩展,并且由于您提到了 C#,所以我肯定会研究它。 Microsoft 在那里做了一些有趣的事情,尽管我不得不认为他们的许多努力似乎更适合 C# 中的语言增强,而不是用于并发编程的单独且不同的库。但我认为他们的努力值得鼓掌和尊重,因为我们很早就到达了这里。 (免责声明:大约 3 年前我曾担任 Visual Studio 的营销总监)
英特尔线程构建模块也很有趣(英特尔最近发布了新版本,我很高兴下周前往英特尔开发者论坛了解有关如何正确使用它的更多信息)。
最后,我在 Corensic 工作,这是一家位于西雅图的软件质量初创公司。我们有一个名为 Jinx 的工具,旨在检测代码中的并发错误。 Windows 和 Linux 均提供 30 天试用版,因此您可能需要查看一下。 (www.corensic.com)
简而言之,Jinx 是一个非常薄的虚拟机管理程序,激活后会滑入处理器和操作系统之间。然后,Jinx 智能地获取执行片段并运行各种线程计时的模拟以查找错误。当我们发现会导致错误发生的特定线程计时时,我们会在您的计算机上使该计时“现实”(例如,如果您使用的是 Visual Studio,调试器将在该点停止)。然后我们会指出代码中导致错误的区域。 Jinx 不会出现误报。当它检测到错误时,它肯定是一个错误。
Jinx 可在 Linux 和 Windows 上运行,并且可以使用本机代码和托管代码。它与语言和应用程序平台无关,并且可以与您所有现有的工具一起使用。
如果您检查过,请向我们发送有关哪些有效和无效的反馈。我们已经在一些大型开源项目上运行 Jinx,并且已经看到 Jinx 发现错误的速度比简单的压力测试代码快 50-100 倍。
One of the answers mentioned the Parallel Extensions for the .NET Framework and since you mentioned C#, it's definitely something I would investigate. Microsoft has done something interesting things there, though I have to think many of their efforts seem more suited for language enhancements in C# than a separate and distinct library for concurrent programming. But I think their efforts are worth applauding and respect that we're early here. (Disclaimer: I used to be the marketing director for Visual Studio about 3 years ago)
The Intel Thread Building Blocks are also quite interesting (Intel recently released a new version, and I'm excited to head down to Intel Developer Forum next week to learn more about how to use it properly).
Lastly, I work for Corensic, a software quality startup in Seattle. We've got a tool called Jinx that is designed to detect concurrency errors in your code. A 30-day trial edition is available for Windows and Linux, so you might want to check it out. (www.corensic.com)
In a nutshell, Jinx is a very thin hypervisor that, when activated, slips in between the processor and operating system. Jinx then intelligently takes slices of execution and runs simulations of various thread timings to look for bugs. When we find a particular thread timing that will cause a bug to happen, we make that timing "reality" on your machine (e.g., if you're using Visual Studio, the debugger will stop at that point). We then point out the area in your code where the bug was caused. There are no false positives with Jinx. When it detects a bug, it's definitely a bug.
Jinx works on Linux and Windows, and in both native and managed code. It is language and application platform agnostic and can work with all your existing tools.
If you check it out, please send us feedback on what works and doesn't work. We've been running Jinx on some big open source projects and already are seeing situations where Jinx can find bugs 50-100 times faster than simply stress testing code.
任何旨在有效利用多个处理器/核心的高性能应用程序(用 C 或 C++ 编写)的瓶颈都是内存系统(高速缓存和 RAM)。单个核心的读取和写入通常会使内存系统饱和,因此很容易理解为什么添加额外的核心和线程会导致应用程序运行速度变慢。如果一队人可以一次通过一扇门,那么增加额外的队列不仅会堵塞门,而且还会降低任何人通过门的效率。
任何多核应用程序的关键是优化和节省内存访问。这意味着将数据和代码结构化为尽可能在自己的缓存内工作,这样它们就不会干扰其他内核访问公共缓存 (L3) 或 RAM。有时核心需要冒险去那里,但诀窍是尽可能减少这种情况。特别是,数据需要围绕缓存行及其大小(当前为 64 字节)进行结构化和适应,并且代码需要紧凑,而不是到处调用和跳转,这也会破坏管道。
我的经验是,有效的解决方案对于相关应用程序来说是独一无二的。通用指南(上面)是构建代码的基础,但是对于那些自己没有参与优化工作的人来说,分析结论所产生的调整变化并不明显。
The bottleneck of any high-performance application (written in C or C++) designed to make efficient use of more than one processor/core is the memory system (caches and RAM). A single core usually saturates the memory system with its reads and writes so it is easy to see why adding extra cores and threads causes an application to run slower. If a queue of people can pass through a door one a time, adding extra queues will not only clog the door but also make the passage of any one individual through the door less efficient.
The key to any multi-core application is optimization of and economizing on memory accesses. This means structuring data and code to work as much as possible inside their own caches where they don't disturb the other cores with acceses to the common cache (L3) or RAM. Once in a while a core needs to venture there but the trick is to reduce those situations as much as possible. In particular, data needs to be structured around and adapted to cache lines and their sizes (currently 64 bytes) and code needs to be compact and not call and jump all over the place which also disrupts pipelines.
My experience is that efficient solutions are unique to the application in question. The generic guidelines (above) are a basis on which to construct code but the tweak changes resulting from profiling conclusions will not be obvious to those who were not themselves involved in the optimizing work.
查找 fork/join 框架和工作窃取运行时。相同或至少相关的方法的两个名称,即递归地将大型任务细分为轻量级单元,以便利用所有可用的并行性,而无需提前知道有多少并行性。这个想法是它应该在单处理器上以串行速度运行,但通过多个内核获得线性加速。
如果你看对了的话,它有点像缓存无关算法的水平模拟。
但我想说,多核编程面临的主要问题是绝大多数计算仍然顽固地保持串行。没有办法在这些计算中投入多个核心并让它们坚持下来。
Look up fork/join frameworks and work-stealing runtimes. Two names for the same, or at least related, approaches, which is to recursively subdivide large tasks into lightweight units, such that all available parallelism is exploited, without having to know in advance how much parallelism there is. The idea is that it should run at serial speed on a uniprocessor, but get a linear speedup with multiple cores.
Sort of a horizontal analogue of cache-oblivious algorithms if you look at it right.
But i'd say the main problem facing multicore programming is that the great majority of computations remain stubbornly serial. There's just no way to throw multiple cores at those computations and make them stick.