术语“CPU 限制”是什么意思？ 和“I/O限制” 意思是？

離人涙 2024-07-26 06:06:26

当您的程序正在等待I/O（即磁盘读/写或网络读/写等），即使程序停止，CPU 也可以自由地执行其他任务。程序的速度主要取决于 IO 发生的速度，如果您想加快速度，则需要加快 I/O。

如果您的程序正在运行大量程序指令并且不等待 I/O，则称其为 CPU 密集型。加快CPU的速度将使程序运行得更快。

无论哪种情况，加速程序的关键可能不是加速硬件，而是优化程序以减少所需的 IO 或 CPU 量，或者让它在执行 I/O 的同时也执行 CPU 密集型操作东西。

回复收藏 0 原文

旧情别恋 2024-07-26 06:06:26

IO 绑定进程：花在 IO 上的时间多于计算的时间，有很多
短 CPU 突发。
CPU 密集型进程：花费更多时间进行计算，很少有很长的 CPU 突发

回复收藏 0 原文

紫﹏色ふ单纯 2024-07-26 06:06:26

看看 Microsoft 怎么说。

异步编程的核心是Task和Task对象，它们
模型异步操作。它们由 async 和
等待关键词。在大多数情况下，该模型相当简单：

对于 I/O 绑定代码，您等待返回任务或
异步方法内的任务。

对于 CPU 密集型代码，您等待在
使用 Task.Run 方法的后台线程。

await 关键字是神奇发生的地方。它产生控制权
执行await的方法的调用者，它最终允许
UI 具有响应能力或服务具有弹性。

I/O 限制示例：从 Web 服务下载数据

private readonly HttpClient _httpClient = new HttpClient();

downloadButton.Clicked += async (o, e) =>
{
    // This line will yield control to the UI as the request
    // from the web service is happening.
    //
    // The UI thread is now free to perform other work.
    var stringData = await _httpClient.GetStringAsync(URL);
    DoSomethingWithData(stringData);
};

CPU 绑定示例：执行游戏计算

private DamageResult CalculateDamageDone()
{
    // Code omitted:
    //
    // Does an expensive calculation and returns
    // the result of that calculation.
}

calculateButton.Clicked += async (o, e) =>
{
    // This line will yield control to the UI while CalculateDamageDone()
    // performs its work.  The UI thread is free to perform other work.
    var damageResult = await Task.Run(() => CalculateDamageDone());
    DisplayDamage(damageResult);
};

上面的示例展示了如何使用异步和
等待 I/O 密集型和 CPU 密集型工作。关键是你能识别
当您需要执行的作业受 I/O 限制或 CPU 限制时，因为它可以
极大地影响代码的性能，并可能导致
滥用某些结构。

在编写任何代码之前，您应该问以下两个问题：

您的代码是否会“等待”某些内容，例如来自
数据库？

如果您的答案是“是”，那么您的工作是 I/O 密集型的。

您的代码会执行非常昂贵的计算吗？

如果您回答“是”，则您的工作受 CPU 限制。

如果您的工作受 I/O 限制，请使用 async 并等待不使用
任务.运行。您不应该使用任务并行库。的原因
异步深入一文对此进行了概述。< /p>

如果您的工作受 CPU 限制并且您关心响应能力，
使用 async 和 wait 但在另一个线程上生成工作
任务.运行. 如果工作适合并发和并行，
您还应该考虑使用任务并行库.

See what Microsoft says.

The core of async programming is the Task and Task objects, which
model asynchronous operations. They are supported by the async and
await keywords. The model is fairly simple in most cases:

For I/O-bound code, you await an operation which returns a Task or
Task inside of an async method.

For CPU-bound code, you await an operation which is started on a
background thread with the Task.Run method.

The await keyword is where the magic happens. It yields control to the
caller of the method that performed await, and it ultimately allows a
UI to be responsive or a service to be elastic.

I/O-Bound Example: Downloading data from a web service

private readonly HttpClient _httpClient = new HttpClient();

downloadButton.Clicked += async (o, e) =>
{
    // This line will yield control to the UI as the request
    // from the web service is happening.
    //
    // The UI thread is now free to perform other work.
    var stringData = await _httpClient.GetStringAsync(URL);
    DoSomethingWithData(stringData);
};

CPU-bound Example: Performing a Calculation for a Game

private DamageResult CalculateDamageDone()
{
    // Code omitted:
    //
    // Does an expensive calculation and returns
    // the result of that calculation.
}

calculateButton.Clicked += async (o, e) =>
{
    // This line will yield control to the UI while CalculateDamageDone()
    // performs its work.  The UI thread is free to perform other work.
    var damageResult = await Task.Run(() => CalculateDamageDone());
    DisplayDamage(damageResult);
};

Examples above showed how you can use async and
await for I/O-bound and CPU-bound work. It's key that you can identify
when a job you need to do is I/O-bound or CPU-bound, because it can
greatly affect the performance of your code and could potentially lead
to misusing certain constructs.

Here are two questions you should ask before you write any code:

Will your code be "waiting" for something, such as data from a
database?

If your answer is "yes", then your work is I/O-bound.

Will your code be performing a very expensive computation?

If you answered "yes", then your work is CPU-bound.

If the work you have is I/O-bound, use async and await without
Task.Run. You should not use the Task Parallel Library. The reason for
this is outlined in the Async in Depth article.

If the work you have is CPU-bound and you care about responsiveness,
use async and await but spawn the work off on another thread with
Task.Run. If the work is appropriate for concurrency and parallelism,
you should also consider using the Task Parallel Library.

回复收藏 0 原文

成熟稳重的好男人 2024-07-26 06:06:26

I/O 限制是指完成计算所需的时间主要由等待输入/输出操作完成所花费的时间决定的条件。

这与受 CPU 限制的任务相反。当请求数据的速率慢于消耗数据的速率时，或者换句话说，请求数据所花费的时间多于处理数据所花费的时间时，就会出现这种情况。

回复收藏 0 原文

不羁少年 2024-07-26 06:06:26

当执行期间的算术/逻辑/浮点 (A/L/FP) 性能大部分接近处理器的理论峰值性能（数据由制造商提供并由处理器的特性确定）时，应用程序受 CPU 限制。处理器：核心数量、频率、寄存器、ALU、FPU 等）。

在实际应用中，查看性能很难实现，但并不是说不可能。大多数应用程序在执行的不同部分访问内存，并且处理器在几个周期内不执行 A/L/FP 操作。由于内存和处理器之间存在距离，这称为冯诺依曼限制。

如果您想接近 CPU 峰值性能，可以采取的策略是尝试重用高速缓存中的大部分数据，以避免需要主内存中的数据。利用此功能的算法是矩阵-矩阵乘法（如果两个矩阵都可以存储在高速缓冲存储器中）。发生这种情况的原因是，如果矩阵的大小为 nx n，那么您需要仅使用 2 n^2 FP 数字执行大约 2 n^3 运算数据的。另一方面，例如，与矩阵乘法相比，矩阵加法是一个较少受 CPU 限制或更多受内存限制的应用程序，因为它只需要具有相同数据的 n^2 FLOP。

下图显示了在 Intel i5-9300H 中使用简单算法进行矩阵加法和矩阵乘法获得的 FLOP：

请注意，正如预期的那样，矩阵的性能乘法比矩阵加法大。这些结果可以通过运行此存储库。

我还建议观看 J. Dongarra 提供的关于此问题的视频影响。

回复收藏 0 原文

瑶笙 2024-07-26 06:06:26

I/O 绑定进程：- 如果进程生命周期的大部分时间都处于 i/o 状态，则该进程是 ai/o 绑定进程。示例：-计算器、Internet Explorer

CPU 绑定进程：- 如果大部分时间处于 i/o 状态，则该进程是 ai/o 绑定进程。进程的生命周期是在cpu中度过的，那么它就是cpu绑定进程。

回复收藏 0 原文

许仙没带伞 2024-07-26 06:06:25

这是非常直观的：

如果 CPU 更快，程序就会运行得更快，即程序的大部分时间只是使用 CPU（进行计算），那么该程序就是 CPU 密集型。 计算 π 新数字的程序通常受 CPU 限制，它只是处理数字。

如果 I/O 子系统更快，程序就会运行得更快，那么该程序就是I/O 绑定。具体指哪个 I/O 系统可能会有所不同；我通常将其与磁盘联系起来，但当然，网络或通信通常也很常见。 在一个大文件中查找某些数据的程序可能会受到 I/O 限制，因为瓶颈是从磁盘读取数据（实际上，这个例子现在可能有点过时了） SSD 的传输速度为数百 MB/秒）。

回复收藏 0 原文

無心 2024-07-26 06:06:25

CPU 限制表示进程的处理速度受到 CPU 速度的限制。对一小组数字执行计算（例如小矩阵相乘）的任务可能会受到 CPU 限制。

I/O 限制表示进程的进度受到 I/O 子系统速度的限制。处理磁盘数据的任务（例如计算文件中的行数）可能会受到 I/O 限制。

内存限制是指进程进行的速度受到可用内存量和内存访问速度的限制。处理大量内存数据的任务（例如乘以大型矩阵）可能会受到内存限制。

缓存绑定是指进程进度受可用缓存数量和速度限制的速率。如果任务处理的数据超出了缓存的容量，则该任务将受到缓存限制。

I/O Bound 比 Memory Bound 慢，比 Cache Bound 慢，比 CPU Bound 慢。

解决 I/O 限制的方法不一定是获得更多内存。在某些情况下，可以围绕 I/O、内存或高速缓存限制来设计访问算法。请参阅缓存遗忘算法。

回复收藏 0 原文

苯莒 2024-07-26 06:06:25

多线程是最重要的

在这个答案中，我将研究区分 CPU 与 IO 有限工作的一个重要用例：编写多线程代码时。

RAM I/O 绑定示例：向量求和

考虑一个对单个向量的所有值求和的程序：

#define SIZE 1000000000
unsigned int is[SIZE];
unsigned int sum = 0;
size_t i = 0;
for (i = 0; i < SIZE; i++)
    /* Each one of those requires a RAM access! */
    sum += is[i]

通过为每个核心平均分割数组来并行化，在常见的现代桌面上用处有限。

例如，在我的 Ubuntu 19.04、Lenovo ThinkPad P51 笔记本电脑上，CPU：Intel Core i7-7820HQ CPU（4 核/8 线程），RAM：2x Samsung M471A2K43BB1-CRC (2x 16GiB) 我得到如下结果：

< a href="https://gist.github.com/cirosantilli/f3403af4bd1d2dd1bed042ac89f55150#file-pthread_array_sum-dat" rel="noreferrer">绘制数据。

但请注意，运行之间存在很大差异。但我无法进一步增加数组大小，因为我已经达到 8GiB，而且今天我没有心情进行多次运行的统计数据。然而，这似乎是经过多次手动运行后的典型运行。

基准代码：

POSIX C pthread< /code> 图中使用的源代码。

这是一个 C++ 版本产生类似的结果。
绘图脚本

我对计算机了解不够架构完全解释了曲线的形状，但有一点很清楚：由于我使用了所有 8 个线程，计算速度并没有像天真的预期那样快 8 倍！出于某种原因，2 个和 3 个线程是最佳选择，添加更多线程只会使速度变慢。

将此与 CPU 密集型工作进行比较，后者实际上确实快了 8 倍：time(1) 的输出中“real”、“user”和“sys”是什么意思？

原因是所有处理器共享单个内存总线链接到 RAM：

CPU 1   --\    Bus    +-----+
CPU 2   ---\__________| RAM |
...     ---/          +-----+
CPU N   --/

因此内存总线很快就会成为瓶颈，而不是 CPU。

发生这种情况是因为添加两个数字需要一个 CPU 周期，内存读取大约需要 100 个 CPU 2016 年硬件周期。

因此，每个输入数据字节完成的 CPU 工作量太小，我们称其为 IO 密集型进程。

进一步加速计算的唯一方法是使用新的内存硬件加速单个内存访问，例如多通道内存。

例如，升级到更快的 CPU 时钟并不是很有用。

其他示例

矩阵乘法在 RAM 和 GPU 上受 CPU 限制。输入包含：

<前><代码>2 * N**2

数字，但是：

<前><代码>N ** 3

乘法已经完成，这足以让并行化对于实际的大 N 来说是值得的。

这就是为什么存在如下并行 CPU 矩阵乘法库的原因：
- http://www.netlib.org/scalapack/pblas_qref.html< /里>
- http://icl.cs.utk.edu/magma/software/
缓存的使用对实现速度有很大影响。例如，请参阅教学 GPU 比较示例。

另请参阅：
- 为什么 GPU 可以比 CPU 更快地进行矩阵乘法？
- BLAS 相当于 GPU 的 LAPACK 函数< /里>
网络是典型的 IO 绑定示例。

即使我们发送一个字节的数据，仍然需要很长时间才能到达目的地。

并行化小型网络请求（例如 HTTP 请求）可以带来巨大的性能提升。

如果网络已经满负荷（例如下载种子），并行化仍然可以增加延迟（例如您可以“同时”加载网页）。
一个虚拟的 C++ CPU 绑定操作，需要一个数字并对其进行大量处理：
- 串行
- 并行
排序似乎是基于以下实验的CPU：C++17 并行算法已经实现了吗？显示并行排序的性能提高了 4 倍，但我也想得到更多理论确认

众所周知的Coremark 基准来自 EEMBC 明确检查一系列问题的扩展程度。示例基准结果清理表明：

工作负载名称 (iter/s) (iter/s) 缩放 
  ----------------------------------------------------------- --- ------- ---------- ---------- 
  cjpeg-rose7-预设 526.32 178.57 2.95 
  核心 7.39 2.16 3.42 
  线性_alg-中-100x100-sp 684.93 238.10 2.88 
  循环所有中 10k-sp 27.65 7.80 3.54 
  nnet_测试 32.79 10.57 3.10 
  解析器-125k 71.43 25.00 2.86 
  基数2-大-64k 2320.19 623.44 3.72 
  SHA 测试 555.56 227.27 2.44 
  压缩测试 363.64 166.67 2.18 

  标记结果表 

  标记名称 多核 单核 扩展 
  ----------------------------------------------------------- --- ------- ---------- ---------- 
  CoreMark-PRO 18743.79 6306.76 2.97

C++ 程序的链接可以在一定程度上并行化：Can gcc use multiple cores when链接？
< a href="https://en.wikipedia.org/wiki/Rigid_body_dynamics" rel="noreferrer">刚体动力学，有时用于游戏引擎，可以在 GPU 上并行化和加速，例如 < a href="https://mujoco.readthedocs.io/en/stable/mjx.html" rel="noreferrer">MuJoCo 和 PhysX 都支持

如何确定您是否受 CPU 或 IO 限制

非 RAM IO 限制（如磁盘、网络）：ps aux，然后检查是否CPU% / 100 n 个线程。如果是，则您是 IO 绑定的，例如阻塞读取只是在等待数据，并且调度程序正在跳过该过程。然后使用诸如 sudo iotop 之类的进一步工具来确定到底是哪个 IO 出现了问题。

或者，如果执行速度很快，并且您参数化了线程数量，那么您可以从时间中轻松看出，随着 CPU 密集型工作的线程数量增加，性能会提高：“真实”、“用户”和“什么” sys' 在 time(1) 的输出中意味着什么？

RAM-IO 限制：很难判断，因为 RAM 等待时间包含在 CPU% 测量中，另请参阅：

一些选项：

Intel Advisor Roofline （非免费）：https://software.intel.com/en -us/articles/intel-advisor-roofline (存档) “Roofline 图表是与硬件限制（包括内存带宽和计算峰值）相关的应用程序性能的直观表示。”

GPU

当您首次将输入数据从常规 CPU 可读 RAM 传输到 GPU 时，GPU 会出现 IO 瓶颈。

因此，对于 CPU 密集型应用程序来说，GPU 只能比 CPU 更好。

然而，一旦数据传输到 GPU，它对这些字节的操作速度就会比 CPU 更快，因为 GPU：

比大多数 CPU 系统具有更多的数据本地化能力，因此某些内核的数据访问速度比其他内核更快
利用数据并行性，并通过跳过任何尚未准备好立即操作的数据来牺牲延迟。

由于 GPU 必须对大型并行输入数据进行操作，因此最好跳到可能可用的下一个数据，而不是像 CPU 那样等待当前数据可用并阻止所有其他操作

因此 GPU如果您的应用程序满足以下条件，则可以比 CPU 更快：

可以高度并行化：不同的数据块可以彼此分开处理，同时
需要每个输入字节进行足够多的操作（与矢量加法不同，矢量加法每个数据块执行一次加法）仅字节）
有大量的输入字节

这些设计选择最初是针对 3D 渲染的应用，其主要步骤如下所示 OpenGL 中的着色器是什么以及我们需要它们做什么？

顶点着色器：乘以一堆 1x4通过 4x4 矩阵
片段着色器生成向量：根据三角形中每个像素的相对位置计算三角形的颜色

，因此我们得出结论，这些应用程序受 CPU 限制。

随着可编程 GPGPU 的出现，我们可以观察到多个 GPGPU 应用程序作为 CPU 绑定操作的示例：

使用 GLSL 着色器进行图像处理？

模糊滤镜等本地图像处理操作本质上是高度并行的。
是否可以以每秒 60 次的速度从点数据构建热图？

如果绘制的函数足够复杂，则绘制热图。
https://www.youtube.com/watch?v =fE0P6H8eK4I“实时流体动力学：CPU 与 GPU”作者：Jesús Martín Berlanga

求解偏微分方程，例如流体的纳维斯托克斯方程动态：
- 本质上高度并行，因为每个点只与邻居交互
- 每个字节往往有足够的操作

另请参阅：

为什么我们仍然使用 CPU 而不是 GPU？
GPU 的缺点是什么？
https://www.youtube.com/watch?v=_cyVDoyI6NE“CPU 与 GPU（有什么区别？）- Computerphile”

CPython 全局解释器锁 (GIL)

作为快速案例研究，我想指出Python全局解释器锁（GIL）：CPython 中的全局解释器锁 (GIL) 是什么？

此 CPython 实现细节可防止多个 Python 线程有效地使用 CPU 密集型工作。 CPython 文档说：

CPython 实现细节：在 CPython 中，由于全局解释器锁，一次只有一个线程可以执行 Python 代码（尽管某些面向性能的库可能会克服这一限制）。如果您希望应用程序更好地利用多核机器的计算资源，建议您使用multiprocessing或concurrent.futures.ProcessPoolExecutor。但是，如果您想同时运行多个 I/O 密集型任务，线程仍然是一个合适的模型。

因此，这里我们有一个示例，CPU 绑定的内容不适合，而 I/O 绑定的内容适合。

JavaScript async 和 Node.js worker_threads

情况与 Python 类似。

JavaScript 本质上是单线程的。不确定这是否是语言标准的一部分（对于 Python 来说不是，除了参考 CPython 实现 AFAIK 之外甚至没有语言标准）。

JavaScript 有 async 关键字，它允许执行停止，然后开始执行其他事情。您可以编写如下内容：

async function myfunc(init) {
  ret = init
  for (let i = 0; i < 1000000; i++) {
    ret += i*i + 2*i + 3
  }
  return ret;
}

async function doNetworkRequest(init) {
  // Some native method that gets network data.
}

const [first, second, myfunc_ret1, myfunc_ret2] = await Promise.all([
  doNetworkRequest('example.com'),
  doNetworkRequest('example2.com'),
  myfunc(1),
  myfunc(2),
])

其中 await 表示“等待所有这些异步事情完成后再继续”。

但是，CPU 上一次只能运行一个async 方法，因此 myfunc 的 CPU 密集型工作根本不会因此而加快。

然而，网络访问的原型 IO 绑定操作可以加快速度，因为这会依次触发两个网络请求，并且在服务器/网络执行工作时等待两者返回。

事实上，语言中有一个专门用于此目的的实际关键字 async，这一事实说明了这一点：毫不奇怪，网络请求在浏览器主导的 JavaScript 环境中非常重要。

然而，随着 Node.js 的出现，人们开始越来越希望并行化 CPU 工作负载，他们得出了与 CPython 类似的解决方案：创建单独的进程而不是线程。这是通过 worker_threads 库，其中：

实际上将 JavaScript 脚本路径作为输入，并启动一个新的解释器实例来运行它，
因为进程不共享内存空间，您可以使用消息传递系统在实例之间传递消息

https://medium.com/@Trott/using-worker-threads-in-node-js- 80494136dbb6 包含一个很好的例子。

worker_threads 的文档再次说明了本答案其他地方提到的差异：

工作线程（线程）对于执行 CPU 密集型 JavaScript 操作非常有用。它们对于 I/O 密集型工作没有多大帮助。 Node.js 内置的异步 I/O 操作比 Workers 更高效。

在浏览器上还有 Web Workers，不确定它与 Node 的实现相比如何：Web Workers 和 Worker Threads 之间有什么区别？

Multi-threading is where it tends to matter the most

In this answer, I will investigate one important use case of distinguishing between CPU vs IO bounded work: when writing multi-threaded code.

RAM I/O bound example: Vector Sum

Consider a program that sums all the values of a single vector:

#define SIZE 1000000000
unsigned int is[SIZE];
unsigned int sum = 0;
size_t i = 0;
for (i = 0; i < SIZE; i++)
    /* Each one of those requires a RAM access! */
    sum += is[i]

Parallelizing that by splitting the array equally for each of your cores is of limited usefulness on common modern desktops.

For example, on my Ubuntu 19.04, Lenovo ThinkPad P51 laptop with CPU: Intel Core i7-7820HQ CPU (4 cores / 8 threads), RAM: 2x Samsung M471A2K43BB1-CRC (2x 16GiB) I get results like this:

Plot data.

Note that there is a lot of variance between run however. But I can't increase the array size much further since I'm already at 8GiB, and I'm not in the mood for statistics across multiple runs today. This seemed however like a typical run after doing many manual runs.

Benchmark code:

POSIX C pthread source code used in the graph.

And here is a C++ version that produces analogous results.
plot script

I don't know enough computer architecture to fully explain the shape of the curve, but one thing is clear: the computation does not become 8x faster as naively expected due to me using all my 8 threads! For some reason, 2 and 3 threads was the optimum, and adding more just makes things much slower.

Compare this to CPU bound work, which actually does get 8 times faster: What do 'real', 'user' and 'sys' mean in the output of time(1)?

The reason it is all processors share a single memory bus linking to RAM:

CPU 1   --\    Bus    +-----+
CPU 2   ---\__________| RAM |
...     ---/          +-----+
CPU N   --/

so the memory bus quickly becomes the bottleneck, not the CPU.

This happens because adding two numbers takes a single CPU cycle, memory reads take about 100 CPU cycles in 2016 hardware.

So the CPU work done per byte of input data is too small, and we call this an IO-bound process.

The only way to speed up that computation further, would be to speed up individual memory accesses with new memory hardware, e.g. Multi-channel memory.

Upgrading to a faster CPU clock for example would not be very useful.

Other examples

matrix multiplication is CPU-bound on RAM and GPUs. The input contains:
```
2 * N**2
```
numbers, but:
```
N ** 3
```
multiplications are done, and that is enough for parallelization to be worth it for practical large N.

This is why parallel CPU matrix multiplication libraries like the following exist:
- http://www.netlib.org/scalapack/pblas_qref.html
- http://icl.cs.utk.edu/magma/software/
Cache usage makes a big difference to the speed of implementations. See for example this didactic GPU comparison example.

See also:
- Why can GPU do matrix multiplication faster than CPU?
- BLAS equivalent of a LAPACK function for GPUs
Networking is the prototypical IO-bound example.

Even when we send a single byte of data, it still takes a large time to reach it's destination.

Parallelizing small network requests like HTTP requests can offer a huge performance gains.

If the network is already at full capacity (e.g. downloading a torrent), parallelization can still increase improve the latency (e.g. you can load a web page "at the same time").
A dummy C++ CPU bound operation that takes one number and crunches it a lot:
- serial
- parallel
Sorting appears to be CPU based on the following experiment: Are C++17 Parallel Algorithms implemented already? which showed a 4x performance improvement for parallel sort, but I would like to have a more theoretical confirmation as well

The well known Coremark benchmark from EEMBC explicitly checks how well a suite of problems scale. Sample benchmark result clearing showing that:

Workload Name                                     (iter/s)   (iter/s)    Scaling
----------------------------------------------- ---------- ---------- ----------
cjpeg-rose7-preset                                  526.32     178.57       2.95
core                                                  7.39       2.16       3.42
linear_alg-mid-100x100-sp                           684.93     238.10       2.88
loops-all-mid-10k-sp                                 27.65       7.80       3.54
nnet_test                                            32.79      10.57       3.10
parser-125k                                          71.43      25.00       2.86
radix2-big-64k                                     2320.19     623.44       3.72
sha-test                                            555.56     227.27       2.44
zip-test                                            363.64     166.67       2.18

MARK RESULTS TABLE

Mark Name                                        MultiCore SingleCore    Scaling
----------------------------------------------- ---------- ---------- ----------
CoreMark-PRO                                      18743.79    6306.76       2.97

the linking of a C++ program can be parallelized to a certain degree: Can gcc use multiple cores when linking?
rigid body dynamics, sometimes used in game engines, can be parallelized and sped up on the GPU, e.g. MuJoCo and PhysX both support it

How to find out if you are CPU or IO bound

Non-RAM IO bound like disk, network: ps aux, then check if CPU% / 100 < n threads. If yes, you are IO bound, e.g. blocking reads are just waiting for data and the scheduler is skipping that process. Then use further tools like sudo iotop to decide which IO is the problem exactly.

Or, if execution is quick, and you parametrize the number of threads, you can see it easily from time that performance improves as the number of threads increases for CPU bound work: What do 'real', 'user' and 'sys' mean in the output of time(1)?

RAM-IO bound: harder to tell, as RAM wait time it is included in CPU% measurements, see also:

Some options:

Intel Advisor Roofline (non-free): https://software.intel.com/en-us/articles/intel-advisor-roofline (archive) "A Roofline chart is a visual representation of application performance in relation to hardware limitations, including memory bandwidth and computational peaks."

GPUs

GPUs have an IO bottleneck when you first transfer the input data from the regular CPU readable RAM to the GPU.

Therefore, GPUs can only be better than CPUs for CPU bound applications.

Once the data is transferred to the GPU however, it can operate on those bytes faster than the CPU can, because the GPU:

has more data localization than most CPU systems, and so data can be accessed faster for some cores than others
exploits data parallelism and sacrifices latency by just skipping over any data that is not ready to be operated on immediately.

Since the GPU has to operate on large parallel input data, it is better to just skip to the next data that might be available instead of waiting for the current data to be come available and block all other operations like the CPU mostly does

Therefore the GPU can be faster then a CPU if your application:

can be highly parallelized: different chunks of data can be treated separately from one another at the same time
requires a large enough number of operations per input byte (unlike e.g. vector addition which does one addition per byte only)
there is a large number of input bytes

These designs choices originally targeted the application of 3D rendering, whose main steps are as shown at What are shaders in OpenGL and what do we need them for?

vertex shader: multiplying a bunch of 1x4 vectors by a 4x4 matrix
fragment shader: calculate the color of each pixel of a triangle based on its relative position withing the triangle

and so we conclude that those applications are CPU-bound.

With the advent of programmable GPGPU, we can observe several GPGPU applications that serve as examples of CPU bound operations:

Image Processing with GLSL shaders?

Local image processing operations such as a blur filter are highly parallel in nature.
Is it possible to build a heatmap from point data at 60 times per second?

Plotting of heatmap graphs if the plotted function is complex enough.
https://www.youtube.com/watch?v=fE0P6H8eK4I "Real-Time Fluid Dynamics: CPU vs GPU" by Jesús Martín Berlanga

Solving partial differential equations such as the Navier Stokes equation of fluid dynamics:
- highly parallel in nature, because each point only interacts with their neighbour
- there tend to be enough operations per byte

术语“CPU 限制”是什么意思？和“I/O限制” 意思是？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（11）

关于作者

相关话题

热门标签

推荐作者

yangzhenyu123

lvzun

执笔绘流年

芯好空

始于初秋

谁与争疯

友情链接

术语“CPU 限制”是什么意思？ 和“I/O限制” 意思是？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（11）

关于作者

相关话题

热门标签

推荐作者

yangzhenyu123

lvzun

执笔绘流年

芯好空

始于初秋

谁与争疯

友情链接

术语“CPU 限制”是什么意思？和“I/O限制” 意思是？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。