为什么不采用乱序混合、大规模并行?

发布于 2024-11-02 14:21:12 字数 528 浏览 5 评论 0 原文

人们似乎流行预测超标量无序 CPU 将会重蹈渡渡鸟的覆辙,并将被大量简单、标量、有序的核心所取代。这在实践中似乎并没有发生,因为即使明天解决了并行化软件的问题,仍然存在大量的遗留软件。此外,并行化软件并不是一个小问题。

我知道 GPGPU 是一种混合模型,其中 CPU 是为单线程性能而设计的,显卡是为并行性而设计的,但它是一个丑陋的模型。程序员需要显式地重写代码才能在显卡上运行,据我所知,为显卡有效地表达并行性比为多核通用 CPU 有效地表达并行性要困难得多。

混合模型有什么问题,其中每台 PC 都配备一两个“昂贵”的超标量乱序核心和 32 或 64 个“廉价”核心,但具有与昂贵核心相同的指令集,并且可能位于同一块处理器上。硅?操作系统会意识到这种不对称性,并会首先调度无序的核心并使用最高优先级的线程。这种优先级甚至可以通过操作系统 API 显式地暴露给程序员,但程序员不会被迫关心这种区别,除非他/她想要控制调度的细节。

编辑:如果投票结束是因为这与编程无关,那么这里有一个反驳:我认为它与编程相关,因为我想听听程序员对为什么这样的模型是好还是坏的想法以及他们是否想要对其进行编程。

It seems in vogue to predict that superscalar out-of-order CPUs are going the way of the dodo and will be replaced by huge amounts of simple, scalar, in-order cores. This doesn't seem to be happening in practice because, even if the problem of parallelizing software were solved tomorrow, there's still tons of legacy software out there. Besides, parallelizing software is not a trivial problem.

I understand that GPGPU is a hybrid model, where the CPU is designed for single-thread performance and the graphics card for parallelism, but it's an ugly one. The programmer needs to explicitly rewrite code to run on the graphics card, and to the best of my understanding expressing parallelism efficiently for a graphics card is much harder than expressing it efficiently for a multicore general-purpose CPU.

What's wrong with a hybrid model where every PC comes with one or two "expensive" superscalar out-of-order cores and 32 or 64 "cheap" cores, but with the same instruction set as the expensive cores and possibly on the same piece of silicon? The operating system would be aware of this asymmetry and would schedule the out-of-order cores first and with the highest priority threads. This prioritization might even be explicitly exposed to the programmer via the OS API, but the programmer wouldn't be forced to care about the distinction unless he/she wanted to control the details of the scheduling.

Edit: If the vote to close is because this supposedly isn't programming related, here's a rebuttal: I think it is programming-related because I want to hear programmers' perspective on why such a model is a good or bad idea and whether they would want to program to it.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

陪你到最终 2024-11-09 14:21:12

W00t,多么好的问题=D

乍一看,我可以看到两个问题。我建议您,现在在公开我的论点时,我将考虑 CPU 限制的并行应用程序。

第一个是对操作系统施加的控制开销。请记住,操作系统负责将进程分派到它们将运行的 CPU。此外,操作系统需要控制对保存此信息的数据结构的并发访问。因此,您遇到了让操作系统抽象任务计划的第一个瓶颈。这已经是一个缺点了。

下面是一个很好的实验。尝试编写一个大量使用 CPU 的应用程序。然后,使用其他应用程序(例如 atsar)获取用户和系统时间的统计信息。现在,改变并发线程的数量并查看系统时间会发生什么情况。绘制数据图可能有助于计算(不是那么=)无用处理的增长。

其次,当您向系统添加内核时,您还需要更强大的总线。 CPU 核心需要与内存交换数据,以便完成计算。因此,有了更多的核心,您将有更多的总线并发访问。有人可能会争辩说,可以设计一个具有多个总线的系统。是的,确实可以设计这样的系统。然而,必须采取额外的机制来保持内核所使用的数据的完整性。某些机制确实存在于缓存级别,但是部署在主内存级别的成本非常非常高。

请记住,每次线程更改内存中的某些数据时,当其他线程访问此数据时,此更改必须传播到其他线程,这是并行应用程序(主要是数字应用程序)中常见的操作。

尽管如此,我确实同意你的立场,即当前的模型很丑陋。是的,现在在 GPGPU 编程模型中表达并行性要困难得多,因为程序员完全负责移动位。我迫切希望为众核和GPGPU应用程序开发提供更简洁、更高层次和标准化的抽象。

W00t, what a great question =D

In a glance, I can see two problems. I advice you that for now on I'll be considering a CPU bounded parallel application when exposing my arguments.

The first one is the control overhead imposed to the operating system. Remember, the OS is responsible for dispatching the processes to the CPU they will run on. Moreover, the OS needs to control the concurrent access to the data structures that holds this information. Thus, you got the first bottleneck of having the OS abstracting the schedule of tasks. This is already a drawback.

The following is a nice experiment. Try to write an application that makes a lot of use of the CPU. Then, with some other application, like atsar, get the statics of user and system time. Now, vary the number of concurrent threads and look to what happens to the system time. Plotting data may help to figure the growth of (not so =) useless processing.

Second, as you add cores to your system, you also need a more powerful bus. CPU cores need to exchange data with the memory so a computation may be done. Thus, with more cores, you'll have more concurrent access to the bus. Someone may argue that a system with more than one bus can be designed. Yes, indeed, such a system may be designed. However, extra mechanisms must be in place to keep the integrity of the data used by the cores. Some mechanism do exist at cache-level, however they are very very expensive to be deployed in the primary memory-level.

Keep in mind that every time a thread changes some data in the memory, this change must be propagated to others threads when they access this data, an action that is usual in parallel applications (mainly in numerical ones).

Nevertheless, I do agree with your position that the current models are ugly. And yes, nowadays is much more difficult to express parallelism in GPGPU programming models as the programmer is totally responsible for moving bits around. I anxiously hope for more succinct, high-level and standardized abstraction for many-core and GPGPU application development.

小霸王臭丫头 2024-11-09 14:21:12

@aikigaeshi 感谢您提及我的论文。我是耶鲁帕特的学生,实际上是题为“加速关键部分”的论文的第一作者。经过大量研究后,我们提出了这个想法。事实上,我最近在一次行业会议上就此发表了重要演讲。这是我研究了几年后的看法:

@dsimcha,这个问题应该分为两部分以便进行良好的分析。

  1. 我们是否需要能够比其他代码更快地运行某些代码?这个问题引出了一个更简单的问题:是否某些代码比其他代码更关键。我将关键代码定义为线程内容的任何代码。当一个线程争夺一段代码来完成执行时,该线程正在等待的代码显然变得更加关键,因为加速该代码不仅可以加快当前执行该代码的线程的速度,还可以加快等待当前线程的线程的速度来完成执行。单线程内核是一个很好的例子,其中所有线程都等待单个线程完成。临界区是另一个示例,其中所有想要进入临界区的线程必须等待任何先前的线程完成临界区。更快地运行此类关键代码显然是一个好主意,因为当代码被竞争时,它本质上变得对性能更加关键。还有其他场景,例如减少、负载不平衡、借用线程问题,可能会导致关键代码,更快地执行此代码会有所帮助。因此,我强烈得出结论,需要所谓的性能不对称。

  2. 我们如何提供性能不对称性?将大核和小核放在同一系统中是提供这种不对称性的一种方法。虽然这是我探索的架构,但还应该进行大量研究来探索提供不对称性的其他方法。频率缩放、对来自关键线程的内存请求进行优先级排序、为关键线程提供更多资源,都是提供不对称性的可能方法。回到大小核心架构:我的研究发现它在大多数情况下都是可行的,因为将任务迁移到大核心的开销被加速关键代码所获得的好处所抵消。我会跳过细节,但有一些非常有趣的权衡。我鼓励您阅读我的论文或我的博士论文以了解详细信息。

我还想指出几个主要事实。
-我能够在不修改软件程序的情况下利用这种非对称芯片(ACMP),这证明它不会增加应用程序员的工作量
-我不认为操作系统工作具有挑战性。我在几周内自己实现了一个运行时,这对我的学习很有帮助。我理解操作系统社区担心更改操作系统,并且我欣赏工程资源的价值,但是,我不同意操作系统更改应该成为限制因素。它的问题将随着时间的推移而被克服。

-经过一年的编写并行程序、研究现有程序、研究处理器设计以及在大公司工作,我实际上相信 ACMP 确实会对程序员有所帮助。在当前模型中,程序员编写并行程序,然后识别串行瓶颈,然后对其进行锤炼,直到其并行化,然后继续处理下一个瓶颈。一般来说,瓶颈变得越来越难解决,并且收益递减。如果硬件提供了某种更快地运行瓶颈的能力——神奇的是——那么程序员就不必浪费那么多时间来获得并行性能。他们可以并行化更容易并行化的代码,并将其余部分留在硬件上。

@aikigaeshi thanks for mentioning my paper. I am Yale Patt's student and actually the first author of the paper titled accelerating critical sections. We came up with the idea after a lot of studies. In fact, I recently gave a major talk at an industry conference about this. Here is my take after studying it for several years:

@dsimcha , the question should be split into two parts for a good analysis.

  1. Do we need need the ability to run some code faster than the rest? This question then leads to a simpler question: is some code more critical than the rest. I defined a critical code as any code for which threads content. When a thread contends for a pice of code to finish execution, the code the thread is waiting for clearly becomes more critical because speeding up that code not only speeds up the thread executing the code currently but also speeds up the thread waiting for the current thread to finish the execution. Single-threaded kernels is a great examples where all threads wait for a single thread to finish. Critical sections is another example where all threads wanting to enter the critical section must wait for any previous thread to finish the critical section. Running such critical code faster is clearly a good idea because when a code is being contended-for, it inherently becomes more performance-critical. There are other scenarios like reductions, load imbalance, loaner thread problems, that can lead to critical code and executing this code faster can help. So I strongly conclude that there is need for what I call performance asymmetry.

  2. How can we provide provide performance asymmetry? Having big and small cores together in the same system is one way of providing this asymmetry. While this is the architecture I explored, a lot of research should be done in exploring other ways to provide asymmetry. Frequency scaling, prioritizing memory requests from the critical threads, giving more resources to the critical thread, are all possible ways of providing asymmetry. Coming back to big and small core architecture: My research found it to be feasible in most cases as the overhead of migrating the tasks to the big core was offset by the benefit you obtained from accelerating the critical code. I would skip the details but there are some very interesting trade-offs. I encourage you to read my papers or my PhD thesis for the detail.

I also want to point out a few major facts.
-I was able to leverage this asymmetric chip (ACMP) without modifying the software programs so thats kind of proof that it will not increase the work for application programmers
-I did not find the OS work to be challenging. I implemented a run-time all by myself in a couple of weeks which worked great for my studies. I understand that there is his fear in the Os community to change OSes and I appreciate the value of engineering resources, however, I disagree that OS changes should be a limiter. Its problems that will be overcome with time.

-After year of writing parallel programs, studying existing programs, studying processor designs, and working at major companies, I am actually convinced that the ACMP actually will help the programmers. In the current model, programmers write a parallel program and then identify the serial bottleneck and then hammer on it until its parallelized and then move on to the next bottleneck. Im general, bottlenecks become harder and harder to tackle and diminishing returns kicks in. If the hardware provided some ability to run the bottleneck faster --magically-- then the programmers would not have to waste so much time to get parallel performance. They could parallelize the easier-to-parallelize code and leave the rest on hardware.

故笙诉离歌 2024-11-09 14:21:12

你的帖子读起来更像是一个假设,而不是一个询问。这个主题被称为异构架构,目前是一个活跃的研究领域。您可以在行业会议上找到有关异质策略的有趣研讨会和主题演讲。

http://scholar.google.com/scholar?q=heterogeneous+architectures&hl=en&btnG=Search&as_sdt=1%2C5&as_sdtp=on

混合模型有什么问题?
每台电脑都配有一个或两个
“昂贵的”超标量乱序
核心和 32 或 64 个“廉价”核心,但是
与相同的指令集
昂贵的核心,并可能在
同一块硅?

这并没有什么“错”,但实际困难却不少。例如,您提到按线程优先级进行调度,但这只是做出明智调度决策所需的众多指标之一。如果您的最高优先级线程是数据流应用程序,而该应用程序对大核心缓存的利用非常差,该怎么办?您的网络系统性能会提高以在小核心上安排此流应用程序吗?

Your post reads more like a hypothesis than an enquiry. This topic is known as heterogeneous architectures and is currently a lively research area. You can find interesting workshops and keynotes on hetero strategies at industry conferences.

http://scholar.google.com/scholar?q=heterogeneous+architectures&hl=en&btnG=Search&as_sdt=1%2C5&as_sdtp=on

What's wrong with a hybrid model where
every PC comes with one or two
"expensive" superscalar out-of-order
cores and 32 or 64 "cheap" cores, but
with the same instruction set as the
expensive cores and possibly on the
same piece of silicon?

There's nothing "wrong" with it, but there are numerous practical difficulties. For example, you mention scheduling by thread priority, but this is only one of many metrics needed to make smart scheduling decisions. What if your highest priority thread is a data streaming app that makes very poor use of the big core caches? Would your net system performance increase to schedule this streaming app on a small core?

且行且努力 2024-11-09 14:21:12

你的想法听起来很像 AMD 的 Fusion 计划。 AMD 正在将 GPU 集成到 CPU 上。目前,这是因为他们的低功耗较慢设计旨在取代英特尔的 Atom,但他们正在将其升级到笔记本电脑芯片中。

我相信关于 AMD 的服务器芯片 Bulldozer 设计将在几年内使用 Fusion 的传言非常可靠,可能完全取代 Bulldozer 浮点单元。

这些 GPU 单元不使用相同的指令集,但考虑到 GPU 内置于 CPU 中,编译器本身可以自由使用它,就像它是任何其他类型的 MMX/SSE 向量指令类型一样。

一个可能的示例是对 C++ 浮点数向量进行数学运算的循环。优化设置为 AMD-Whatever 的编译器可以编写机器代码来固定向量内存、调用 GPU 程序并等待结果。

这仅比 SSE 的自动向量化优化已经执行的操作稍微复杂一点:它们将数据加载到 XMM 寄存器中,执行操作并将数据从寄存器中分离出来。

Your idea sounds much like AMD's plans for Fusion. AMD is integrating a GPU onto the CPU. Right now, this is for their low-power slower designs intended to replace Intel's Atom but they are moving it up into laptop chips.

I believe the rumors are pretty reliable that AMD's Bulldozer design for server chips will be using Fusion in a couple of years, possibly entirely replacing the Bulldozer floating point units.

These GPU units are not using the same instruction set, but consider that with the GPU built into the CPU the compiler itself is free to use it just as if it was any other type of MMX/SSE vector instruction type.

A possible example is a loop doing math on a C++ vector of floating point numbers. The compiler, with optimizations set to AMD-Whatever, could write machine code to pin the vector memory, invoke a GPU program and wait for the results.

This is only a bit more complicated than what the auto-vectorize optimizations for SSE do already: they load the data into a XMM register, do the operation and split the data back out of the register.

地狱即天堂 2024-11-09 14:21:12

许多大型架构专家实际上会同意您的观点,即异构架构显示出很大的前景。我看到了 关键部分可以迁移。

A lot of the big architecture guys would actually agree with you in that heterogenous architectures show a lot of promise. I saw a talk by Yale Patt the other day in which he took this position, predicting that the next generation of successful architectures would consist of a few large fast cores supplemented with a lot of smaller cores. One group used this idea to actually mitigate the overheads of concurrency by providing a bigger core to which threads executing in critical sections could be migrated.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文