当前位置：文江博客话题详情

多CPU、多核、超线程有什么区别？

发布于 2024-07-16 14:35:49 字数 359 浏览 17 评论 0原文

谁能向我解释一下多CPU、多核和超线程之间的区别？我总是对这些差异以及不同场景下每种架构的优缺点感到困惑。

这是我在网上学习和参考别人的评论后目前的理解。

我认为超线程是其中最劣质的技术，但是便宜。其主要思想是重复寄存器以节省上下文切换时间；
多处理器比超线程好，但由于不同的CPU在不同的芯片上，不同CPU之间的通信比多核有更长的延迟，并且使用多个芯片比多核有更多的费用和功耗;
多核将所有CPU集成在一个芯片上，因此不同CPU之间通信的延迟比多处理器大大降低。由于它使用一个芯片来包含所有 CPU，因此它比多处理器系统消耗更少的功率并且更便宜。

它是否正确？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

╭⌒浅淡时光〆 2024-07-23 14:35:49

多 CPU 是第一个版本：您将拥有一个或多个主板，上面装有一个或多个 CPU 芯片。这里的主要问题是 CPU 必须将一些内部数据暴露给其他 CPU，这样它们就不会妨碍它们。

下一步是超线程。主板上有一个芯片，但它内部有一些部件，所以它可以同时执行两条指令。

目前的发展是多核。它基本上是最初的想法（几个完整的 CPU），但在一个芯片中。优点：芯片设计人员可以轻松地将同步信号的额外电线放入芯片中（而不必将它们从引脚上引出，然后通过拥挤的主板并进入第二个芯片）。

今天的超级计算机是多CPU、多核的：它们有很多主板，上面通常有2-4个CPU，每个CPU都是多核的，每个CPU都有自己的RAM。

[编辑] 你说得非常正确。仅几个小要点：

超线程在单个内核中同时跟踪两个上下文，从而为无序 CPU 内核提供更多并行性。即使一个线程因缓存未命中、分支错误预测或等待高延迟指令的结果而停滞，这也可以使执行单元继续工作。这是一种无需复制太多硬件即可获得更多总吞吐量的方法，但如果有的话，它会单独减慢每个线程的速度。有关更多详细信息，请参阅此问答，并解释本段前面的措辞有什么问题。
多 CPU 的主要问题是在它们上运行的代码最终会访问 RAM。有 N 个 CPU，但只有一条总线访问 RAM。因此，您必须拥有一些硬件来确保 a) 每个 CPU 获得相当数量的 RAM 访问，b) 访问 RAM 的同一部分不会导致问题，c) 最重要的是，CPU 2 将收到通知当CPU 1写入CPU 2在其内部缓存中的某个内存地址时。如果这种情况没有发生，CPU 2 将很乐意使用缓存的值，而不会注意到它已经过时了
假设您的列表中有任务，并且您希望将它们分布到所有可用的 CPU。因此，CPU 1 将从列表中获取第一个元素并更新指针。 CPU 2 也会做同样的事情。出于效率原因，两个 CPU 不仅会将几个字节复制到高速缓存中，还会将整个“高速缓存行”（无论是什么）复制到高速缓存中。假设是，当您读取字节 X 时，您很快也会读取 X+1。
现在两个 CPU 的缓存中都有内存的副本。然后，CPU 1 将从列表中获取下一项。如果没有缓存同步，它不会注意到 CPU 2 也更改了列表，并且它将开始处理与 CPU 2 相同的项目。
这实际上使多 CPU 变得如此复杂。这样做的副作用可能会导致性能比整个代码仅在单个 CPU 上运行时的性能更差。解决方案是多核：您可以根据需要轻松添加任意数量的线路来同步缓存；您甚至可以将数据从一个缓存复制到另一个缓存（更新缓存行的部分，而无需刷新和重新加载它）等。或者缓存逻辑可以确保所有 CPU 获得相同的缓存行当它们访问真实 RAM 的同一部分时，只需将 CPU 2 阻塞几纳秒，直到 CPU 1 做出更改。

[EDIT2] 多核比多 CPU 更简单的主要原因是，在主板上，您根本无法在两个芯片之间运行所有电线，而您需要这些电线才能使同步有效。另外，信号的传播速度最高可达 30 厘米/纳秒（光速；在电线中，通常要少得多）。并且不要忘记，在多层主板上，信号开始相互影响（串扰）。我们喜欢认为 0 是 0V，1 是 5V，但实际上，“0”是介于 -0.5V（从 1->0 断开线路时超速）和 0.5V 之间的电压，而“1”是高于 0.8V 的电压。

如果您将所有内容都集成在单个芯片中，则信号运行速度会更快，并且您可以拥有任意数量的信号（好吧，几乎:)。此外，信号串扰也更容易控制。

Multi-CPU was the first version: You'd have one or more mainboards with one or more CPU chips on them. The main problem here was that the CPUs would have to expose some of their internal data to the other CPU so they wouldn't get in their way.

The next step was hyper-threading. One chip on the mainboard but it had some parts twice internally so it could execute two instructions at the same time.

The current development is multi-core. It's basically the original idea (several complete CPUs) but in a single chip. The advantage: Chip designers can easily put the additional wires for the sync signals into the chip (instead of having to route them out on a pin, then over the crowded mainboard and up into a second chip).

Super computers today are multi-cpu, multi-core: They have lots of mainboards with usually 2-4 CPUs on them, each CPU is multi-core and each has its own RAM.

[EDIT] You got that pretty much right. Just a few minor points:

Hyper-threading keeps track of two contexts at once in a single core, exposing more parallelism to the out-of-order CPU core. This keeps the execution units fed with work, even when one thread is stalled on a cache miss, branch mispredict, or waiting for results from high-latency instructions. It's a way to get more total throughput without replicating much hardware, but if anything it slows down each thread individually. See this Q&A for more details, and an explanation of what was wrong with the previous wording of this paragraph.
The main problem with multi-CPU is that code running on them will eventually access the RAM. There are N CPUs but only one bus to access the RAM. So you must have some hardware which makes sure that a) each CPU gets a fair amount of RAM access, b) that accesses to the same part of the RAM don't cause problems and c) most importantly, that CPU 2 will be notified when CPU 1 writes to some memory address which CPU 2 has in its internal cache. If that doesn't happen, CPU 2 will happily use the cached value, oblivious to the fact that it is outdated
Just imagine you have tasks in a list and you want to spread them to all available CPUs. So CPU 1 will fetch the first element from the list and update the pointers. CPU 2 will do the same. For efficiency reasons, both CPUs will not only copy the few bytes into the cache but a whole "cache line" (whatever that may be). The assumption is that, when you read byte X, you'll soon read X+1, too.
Now both CPUs have a copy of the memory in their cache. CPU 1 will then fetch the next item from the list. Without cache sync, it won't have noticed that CPU 2 has changed the list, too, and it will start to work on the same item as CPU 2.
This is what effectively makes multi-CPU so complicated. Side effects of this can lead to a performance which is worse than what you'd get if the whole code ran only on a single CPU. The solution was multi-core: You can easily add as many wires as you need to synchronize the caches; you could even copy data from one cache to another (updating parts of a cache line without having to flush and reload it), etc. Or the cache logic could make sure that all CPUs get the same cache line when they access the same part of real RAM, simply blocking CPU 2 for a few nanoseconds until CPU 1 has made its changes.

[EDIT2] The main reason why multi-core is simpler than multi-cpu is that on a mainboard, you simply can't run all wires between the two chips which you'd need to make sync effective. Plus a signal only travels 30cm/ns tops (speed of light; in a wire, you usually have much less). And don't forget that, on a multi-layer mainboard, signals start to influence each other (crosstalk). We like to think that 0 is 0V and 1 is 5V but in reality, "0" is something between -0.5V (overdrive when dropping a line from 1->0) and .5V and "1" is anything above 0.8V.

If you have everything inside of a single chip, signals run much faster and you can have as many as you like (well, almost :). Also, signal crosstalk is much easier to control.

回复收藏 0 原文

冷清清 2024-07-23 14:35:49

您可以在英特尔网站或耶鲁大学的一篇短文。

我希望您在这里找到您需要的所有信息。

回复收藏 0 原文

-残月青衣踏尘吟 2024-07-23 14:35:49

简而言之：多CPU或多处理器系统具有多个处理器。多核系统是在同一芯片上具有多个处理器的多处理器系统。在超线程中，多个线程可以在同一处理器上运行（即这些多个线程之间的上下文切换时间非常小）。

多处理器已经存在 30 年了，但主要是在实验室中。多核是新流行的多处理器。如今，服务器处理器实现了超线程和多处理器。

关于这些主题的维基百科文章非常具有说明性。

回复收藏 0 原文

划一舟意中人 2024-07-23 14:35:49

超线程是比多核更便宜且速度较慢的替代方案

英特尔手册第 3 卷系统编程指南 - 325384-056US 2015 年 9 月 8.7 《英特尔超线程技术架构》简要描述了 HT。它包含以下图表：

TODO 在实际应用程序中平均慢了多少百分比？

超线程之所以成为可能，是因为现代单 CPU 核心已经通过指令管道 https://en 同时执行多条指令。 wikipedia.org/wiki/Instruction_pipelined

指令流水线是单个内核内部功能的分离，以确保电路的每个部分在任何给定时间都被使用：读取内存、解码指令、执行指令等。

超线程通过使用以下功能进一步分离功能：

单个后端，它实际上通过其管道运行指令。
双核有两个后端，这解释了更高的成本和性能。
两个前端，它们采用两个指令流，并通过避免危险。
双核还有 2 个前端，每个后端一个。
在某些边缘情况下，指令重新排序不会产生任何好处，从而使超线程毫无用处。但它在平均水平上产生了显着的提高。

与两个不同的核心（仅共享 L3）相比，单个核心中的两个超线程共享更多缓存级别（TODO 有多少？L1？），请参阅：

每个超线程向操作系统公开的接口与实际内核的接口类似，并且两者都可以单独控制。因此，cat /proc/cpuinfo 显示了 4 个处理器，尽管我只有 2 个内核，每个内核有 2 个超线程。

然而，操作系统可以利用知道哪些超线程位于同一核心上的优势，在单个核心上运行给定程序的多个线程，这可能会提高缓存的使用率。

此 LinusTechTips 视频包含轻松的非技术解释：https://www.youtube.com /watch?v=wnS50lJicXc

多 CPU 有点像多核，但通信只能通过 RAM 进行，而不是 L3 缓存

这意味着，如果可能的话，您希望对使用每个单独的CPU都有很多相同的内存。

例如，以下 SBI-7228R-T2X 刀片服务器包含 4 个 CPU，每个节点 2 个：

来源。

我们可以看到，似乎有 4 个 CPU 插槽，每个插槽都被散热器覆盖，其中一个是敞开的。

我认为在节点之间，它们甚至不共享 RAM 内存，必须通过某种网络进行通信，从而代表在超线程/多核/多 CPU 层次结构上又向前迈出了一步，TODO 确认：

Hyperthreading is a cheaper and slower alternative to having multiple-cores

The Intel Manual Volume 3 System Programming Guide - 325384-056US September 2015 8.7 "INTEL HYPER-THREADING TECHNOLOGY ARCHITECTURE" describes HT briefly. It contains the following diagram:

TODO it is slower by how much percent in average in real applications?

Hyperthreading is possible because modern single CPUs cores already execute multiple instructions at once with the instruction pipeline https://en.wikipedia.org/wiki/Instruction_pipelining

The instruction pipeline is a separation of functions inside of a single core to ensure that each part of the circuit is used at any given time: reading memory, decoding instructions, executing instructions, etc.

Hyperthreading separates functions further by using:

a single backend, which actually runs the instructions with its pipeline.
Dual core has two backends, which explains the greater cost and performance.
two front-ends, which take two streams of instructions and order them in a way to maximize pipelining usage of the single backend by avoiding hazards.
Dual core would also have 2 front-ends, one for each backend.
There are edge cases where instruction reordering produces no benefit, making hyperthreading useless. But it produces a significant improvement in average.

Two hyperthreads in a single core share further cache levels (TODO how many? L1?) than two different cores, which share only L3, see:

The interface that each hyperthread exposes to the operating system is similar to that of an actual core, and both can be controlled separately. Thus cat /proc/cpuinfo shows me 4 processors, even though I only have 2 cores with 2 hyperthreads each.

Operating systems can however take advantage of knowing which hyperthreads are on the same core to run multiple threads of a given program on a single core, which might improve cache usage.

This LinusTechTips video contains a light-hearted non-technical explanation: https://www.youtube.com/watch?v=wnS50lJicXc

Multi-CPU is a bit like multicore, but communication can only happen through RAM, not L3 cache

This means that if possible, you want to partition tasks that use the same memory a lot for each separate CPU.

E.g. the following SBI-7228R-T2X blade server contains 4 CPUs, 2 on each node:

Source.

We can see that there seem to be 4 sockets for the CPUs, each covered by a heat sink, with one open.

I think across the nodes, they don't even share RAM memory and must communicate through some kind of networking, thus representing one further step up on the hyperthread/multicore/multi-CPU hierarchy, TODO confirm:

回复收藏 0 原文

~没有更多了~