为什么不采用乱序混合、大规模并行？

发布于 2024-11-02 14:21:12 字数 528 浏览 5 评论 0 原文

人们似乎流行预测超标量无序 CPU 将会重蹈渡渡鸟的覆辙，并将被大量简单、标量、有序的核心所取代。这在实践中似乎并没有发生，因为即使明天解决了并行化软件的问题，仍然存在大量的遗留软件。此外，并行化软件并不是一个小问题。

我知道 GPGPU 是一种混合模型，其中 CPU 是为单线程性能而设计的，显卡是为并行性而设计的，但它是一个丑陋的模型。程序员需要显式地重写代码才能在显卡上运行，据我所知，为显卡有效地表达并行性比为多核通用 CPU 有效地表达并行性要困难得多。

混合模型有什么问题，其中每台 PC 都配备一两个“昂贵”的超标量乱序核心和 32 或 64 个“廉价”核心，但具有与昂贵核心相同的指令集，并且可能位于同一块处理器上。硅？操作系统会意识到这种不对称性，并会首先调度无序的核心并使用最高优先级的线程。这种优先级甚至可以通过操作系统 API 显式地暴露给程序员，但程序员不会被迫关心这种区别，除非他/她想要控制调度的细节。

编辑：如果投票结束是因为这与编程无关，那么这里有一个反驳：我认为它与编程相关，因为我想听听程序员对为什么这样的模型是好还是坏的想法以及他们是否想要对其进行编程。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

陪你到最终 2024-11-09 14:21:12

W00t，多么好的问题=D

乍一看，我可以看到两个问题。我建议您，现在在公开我的论点时，我将考虑 CPU 限制的并行应用程序。

第一个是对操作系统施加的控制开销。请记住，操作系统负责将进程分派到它们将运行的 CPU。此外，操作系统需要控制对保存此信息的数据结构的并发访问。因此，您遇到了让操作系统抽象任务计划的第一个瓶颈。这已经是一个缺点了。

下面是一个很好的实验。尝试编写一个大量使用 CPU 的应用程序。然后，使用其他应用程序（例如 atsar）获取用户和系统时间的统计信息。现在，改变并发线程的数量并查看系统时间会发生什么情况。绘制数据图可能有助于计算（不是那么=）无用处理的增长。

其次，当您向系统添加内核时，您还需要更强大的总线。 CPU 核心需要与内存交换数据，以便完成计算。因此，有了更多的核心，您将有更多的总线并发访问。有人可能会争辩说，可以设计一个具有多个总线的系统。是的，确实可以设计这样的系统。然而，必须采取额外的机制来保持内核所使用的数据的完整性。某些机制确实存在于缓存级别，但是部署在主内存级别的成本非常非常高。

请记住，每次线程更改内存中的某些数据时，当其他线程访问此数据时，此更改必须传播到其他线程，这是并行应用程序（主要是数字应用程序）中常见的操作。

尽管如此，我确实同意你的立场，即当前的模型很丑陋。是的，现在在 GPGPU 编程模型中表达并行性要困难得多，因为程序员完全负责移动位。我迫切希望为众核和GPGPU应用程序开发提供更简洁、更高层次和标准化的抽象。

W00t, what a great question =D

In a glance, I can see two problems. I advice you that for now on I'll be considering a CPU bounded parallel application when exposing my arguments.

The first one is the control overhead imposed to the operating system. Remember, the OS is responsible for dispatching the processes to the CPU they will run on. Moreover, the OS needs to control the concurrent access to the data structures that holds this information. Thus, you got the first bottleneck of having the OS abstracting the schedule of tasks. This is already a drawback.

The following is a nice experiment. Try to write an application that makes a lot of use of the CPU. Then, with some other application, like atsar, get the statics of user and system time. Now, vary the number of concurrent threads and look to what happens to the system time. Plotting data may help to figure the growth of (not so =) useless processing.

Second, as you add cores to your system, you also need a more powerful bus. CPU cores need to exchange data with the memory so a computation may be done. Thus, with more cores, you'll have more concurrent access to the bus. Someone may argue that a system with more than one bus can be designed. Yes, indeed, such a system may be designed. However, extra mechanisms must be in place to keep the integrity of the data used by the cores. Some mechanism do exist at cache-level, however they are very very expensive to be deployed in the primary memory-level.

Keep in mind that every time a thread changes some data in the memory, this change must be propagated to others threads when they access this data, an action that is usual in parallel applications (mainly in numerical ones).

Nevertheless, I do agree with your position that the current models are ugly. And yes, nowadays is much more difficult to express parallelism in GPGPU programming models as the programmer is totally responsible for moving bits around. I anxiously hope for more succinct, high-level and standardized abstraction for many-core and GPGPU application development.

回复收藏 0 原文

小霸王臭丫头 2024-11-09 14:21:12

@aikigaeshi 感谢您提及我的论文。我是耶鲁帕特的学生，实际上是题为“加速关键部分”的论文的第一作者。经过大量研究后，我们提出了这个想法。事实上，我最近在一次行业会议上就此发表了重要演讲。这是我研究了几年后的看法：

@dsimcha，这个问题应该分为两部分以便进行良好的分析。

我们是否需要能够比其他代码更快地运行某些代码？这个问题引出了一个更简单的问题：是否某些代码比其他代码更关键。我将关键代码定义为线程内容的任何代码。当一个线程争夺一段代码来完成执行时，该线程正在等待的代码显然变得更加关键，因为加速该代码不仅可以加快当前执行该代码的线程的速度，还可以加快等待当前线程的线程的速度来完成执行。单线程内核是一个很好的例子，其中所有线程都等待单个线程完成。临界区是另一个示例，其中所有想要进入临界区的线程必须等待任何先前的线程完成临界区。更快地运行此类关键代码显然是一个好主意，因为当代码被竞争时，它本质上变得对性能更加关键。还有其他场景，例如减少、负载不平衡、借用线程问题，可能会导致关键代码，更快地执行此代码会有所帮助。因此，我强烈得出结论，需要所谓的性能不对称。
我们如何提供性能不对称性？将大核和小核放在同一系统中是提供这种不对称性的一种方法。虽然这是我探索的架构，但还应该进行大量研究来探索提供不对称性的其他方法。频率缩放、对来自关键线程的内存请求进行优先级排序、为关键线程提供更多资源，都是提供不对称性的可能方法。回到大小核心架构：我的研究发现它在大多数情况下都是可行的，因为将任务迁移到大核心的开销被加速关键代码所获得的好处所抵消。我会跳过细节，但有一些非常有趣的权衡。我鼓励您阅读我的论文或我的博士论文以了解详细信息。

我还想指出几个主要事实。
-我能够在不修改软件程序的情况下利用这种非对称芯片（ACMP），这证明它不会增加应用程序员的工作量
-我不认为操作系统工作具有挑战性。我在几周内自己实现了一个运行时，这对我的学习很有帮助。我理解操作系统社区担心更改操作系统，并且我欣赏工程资源的价值，但是，我不同意操作系统更改应该成为限制因素。它的问题将随着时间的推移而被克服。

-经过一年的编写并行程序、研究现有程序、研究处理器设计以及在大公司工作，我实际上相信 ACMP 确实会对程序员有所帮助。在当前模型中，程序员编写并行程序，然后识别串行瓶颈，然后对其进行锤炼，直到其并行化，然后继续处理下一个瓶颈。一般来说，瓶颈变得越来越难解决，并且收益递减。如果硬件提供了某种更快地运行瓶颈的能力——神奇的是——那么程序员就不必浪费那么多时间来获得并行性能。他们可以并行化更容易并行化的代码，并将其余部分留在硬件上。

@aikigaeshi thanks for mentioning my paper. I am Yale Patt's student and actually the first author of the paper titled accelerating critical sections. We came up with the idea after a lot of studies. In fact, I recently gave a major talk at an industry conference about this. Here is my take after studying it for several years:

@dsimcha , the question should be split into two parts for a good analysis.

Do we need need the ability to run some code faster than the rest? This question then leads to a simpler question: is some code more critical than the rest. I defined a critical code as any code for which threads content. When a thread contends for a pice of code to finish execution, the code the thread is waiting for clearly becomes more critical because speeding up that code not only speeds up the thread executing the code currently but also speeds up the thread waiting for the current thread to finish the execution. Single-threaded kernels is a great examples where all threads wait for a single thread to finish. Critical sections is another example where all threads wanting to enter the critical section must wait for any previous thread to finish the critical section. Running such critical code faster is clearly a good idea because when a code is being contended-for, it inherently becomes more performance-critical. There are other scenarios like reductions, load imbalance, loaner thread problems, that can lead to critical code and executing this code faster can help. So I strongly conclude that there is need for what I call performance asymmetry.
How can we provide provide performance asymmetry? Having big and small cores together in the same system is one way of providing this asymmetry. While this is the architecture I explored, a lot of research should be done in exploring other ways to provide asymmetry. Frequency scaling, prioritizing memory requests from the critical threads, giving more resources to the critical thread, are all possible ways of providing asymmetry. Coming back to big and small core architecture: My research found it to be feasible in most cases as the overhead of migrating the tasks to the big core was offset by the benefit you obtained from accelerating the critical code. I would skip the details but there are some very interesting trade-offs. I encourage you to read my papers or my PhD thesis for the detail.

I also want to point out a few major facts.
-I was able to leverage this asymmetric chip (ACMP) without modifying the software programs so thats kind of proof that it will not increase the work for application programmers
-I did not find the OS work to be challenging. I implemented a run-time all by myself in a couple of weeks which worked great for my studies. I understand that there is his fear in the Os community to change OSes and I appreciate the value of engineering resources, however, I disagree that OS changes should be a limiter. Its problems that will be overcome with time.

-After year of writing parallel programs, studying existing programs, studying processor designs, and working at major companies, I am actually convinced that the ACMP actually will help the programmers. In the current model, programmers write a parallel program and then identify the serial bottleneck and then hammer on it until its parallelized and then move on to the next bottleneck. Im general, bottlenecks become harder and harder to tackle and diminishing returns kicks in. If the hardware provided some ability to run the bottleneck faster --magically-- then the programmers would not have to waste so much time to get parallel performance. They could parallelize the easier-to-parallelize code and leave the rest on hardware.

回复收藏 0 原文