术语“CPU 限制”是什么意思? 和“I/O限制” 意思是?

发布于 2024-07-19 06:06:25 字数 32 浏览 4 评论 0 原文

术语“CPU 限制”和“I/O 限制”是什么意思?

What do the terms "CPU bound" and "I/O bound" mean?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(11

離人涙 2024-07-26 06:06:26

当您的程序正在等待I/O(即磁盘读/写或网络读/写等),即使程序停止,CPU 也可以自由地执行其他任务。 程序的速度主要取决于 IO 发生的速度,如果您想加快速度,则需要加快 I/O。

如果您的程序正在运行大量程序指令并且不等待 I/O,则称其为 CPU 密集型。 加快CPU的速度将使程序运行得更快。

无论哪种情况,加速程序的关键可能不是加速硬件,而是优化程序以减少所需的 IO 或 CPU 量,或者让它在执行 I/O 的同时也执行 CPU 密集型操作东西。

When your program is waiting for I/O (ie. a disk read/write or network read/write etc), the CPU is free to do other tasks even if your program is stopped. The speed of your program will mostly depend on how fast that IO can happen, and if you want to speed it up you will need to speed up the I/O.

If your program is running lots of program instructions and not waiting for I/O, then it is said to be CPU bound. Speeding up the CPU will make the program run faster.

In either case, the key to speeding up the program might not be to speed up the hardware, but to optimize the program to reduce the amount of IO or CPU it needs, or to have it do I/O while it also does CPU intensive stuff.

旧情别恋 2024-07-26 06:06:26

IO 绑定进程:花在 IO 上的时间多于计算的时间,有很多
短 CPU 突发。
CPU 密集型进程:花费更多时间进行计算,很少有很长的 CPU 突发

IO bound processes: spend more time doing IO than computations, have many
short CPU bursts.
CPU bound processes: spend more time doing computations, few very long CPU bursts

紫﹏色ふ单纯 2024-07-26 06:06:26

看看 Microsoft 怎么说。

异步编程的核心是Task和Task对象,它们
模型异步操作。 它们由 async 和
等待关键词。 在大多数情况下,该模型相当简单:

  • 对于 I/O 绑定代码,您等待返回任务或
    异步方法内的任务。

  • 对于 CPU 密集型代码,您等待在
    使用 Task.Run 方法的后台线程。

await 关键字是神奇发生的地方。 它产生控制权
执行await的方法的调用者,它最终允许
UI 具有响应能力或服务具有弹性。

I/O 限制示例:从 Web 服务下载数据

private readonly HttpClient _httpClient = new HttpClient();

downloadButton.Clicked += async (o, e) =>
{
    // This line will yield control to the UI as the request
    // from the web service is happening.
    //
    // The UI thread is now free to perform other work.
    var stringData = await _httpClient.GetStringAsync(URL);
    DoSomethingWithData(stringData);
};

CPU 绑定示例:执行游戏计算

private DamageResult CalculateDamageDone()
{
    // Code omitted:
    //
    // Does an expensive calculation and returns
    // the result of that calculation.
}

calculateButton.Clicked += async (o, e) =>
{
    // This line will yield control to the UI while CalculateDamageDone()
    // performs its work.  The UI thread is free to perform other work.
    var damageResult = await Task.Run(() => CalculateDamageDone());
    DisplayDamage(damageResult);
};

上面的示例展示了如何使用异步和
等待 I/O 密集型和 CPU 密集型工作。 关键是你能识别
当您需要执行的作业受 I/O 限制或 CPU 限制时,因为它可以
极大地影响代码的性能,并可能导致
滥用某些结构。

在编写任何代码之前,您应该问以下两个问题:

您的代码是否会“等待”某些内容,例如来自
数据库?

  • 如果您的答案是“是”,那么您的工作是 I/O 密集型的。

您的代码会执行非常昂贵的计算吗?

  • 如果您回答“是”,则您的工作受 CPU 限制。

如果您的工作受 I/O 限制,请使用 async 并等待不使用
任务.运行。 您不应该使用任务并行库。 的原因
异步深入一文对此进行了概述。< /p>

如果您的工作受 CPU 限制并且您关心响应能力,
使用 async 和 wait 但在另一个线程上生成工作
任务.运行. 如果工作适合并发和并行,
您还应该考虑使用 任务并行库.

See what Microsoft says.

The core of async programming is the Task and Task objects, which
model asynchronous operations. They are supported by the async and
await keywords. The model is fairly simple in most cases:

  • For I/O-bound code, you await an operation which returns a Task or
    Task inside of an async method.

  • For CPU-bound code, you await an operation which is started on a
    background thread with the Task.Run method.

The await keyword is where the magic happens. It yields control to the
caller of the method that performed await, and it ultimately allows a
UI to be responsive or a service to be elastic.

I/O-Bound Example: Downloading data from a web service

private readonly HttpClient _httpClient = new HttpClient();

downloadButton.Clicked += async (o, e) =>
{
    // This line will yield control to the UI as the request
    // from the web service is happening.
    //
    // The UI thread is now free to perform other work.
    var stringData = await _httpClient.GetStringAsync(URL);
    DoSomethingWithData(stringData);
};

CPU-bound Example: Performing a Calculation for a Game

private DamageResult CalculateDamageDone()
{
    // Code omitted:
    //
    // Does an expensive calculation and returns
    // the result of that calculation.
}

calculateButton.Clicked += async (o, e) =>
{
    // This line will yield control to the UI while CalculateDamageDone()
    // performs its work.  The UI thread is free to perform other work.
    var damageResult = await Task.Run(() => CalculateDamageDone());
    DisplayDamage(damageResult);
};

Examples above showed how you can use async and
await for I/O-bound and CPU-bound work. It's key that you can identify
when a job you need to do is I/O-bound or CPU-bound, because it can
greatly affect the performance of your code and could potentially lead
to misusing certain constructs.

Here are two questions you should ask before you write any code:

Will your code be "waiting" for something, such as data from a
database?

  • If your answer is "yes", then your work is I/O-bound.

Will your code be performing a very expensive computation?

  • If you answered "yes", then your work is CPU-bound.

If the work you have is I/O-bound, use async and await without
Task.Run. You should not use the Task Parallel Library. The reason for
this is outlined in the Async in Depth article.

If the work you have is CPU-bound and you care about responsiveness,
use async and await but spawn the work off on another thread with
Task.Run. If the work is appropriate for concurrency and parallelism,
you should also consider using the Task Parallel Library.

成熟稳重的好男人 2024-07-26 06:06:26

I/O 限制是指完成计算所需的时间主要由等待输入/输出操作完成所花费的时间决定的条件。

这与受 CPU 限制的任务相反。 当请求数据的速率慢于消耗数据的速率时,或者换句话说,请求数据所花费的时间多于处理数据所花费的时间时,就会出现这种情况。

I/O bound refers to a condition in which the time it takes to complete a computation is determined principally by the period spent waiting for input/output operations to be completed.

This is the opposite of a task being CPU bound. This circumstance arises when the rate at which data is requested is slower than the rate it is consumed or, in other words, more time is spent requesting data than processing it.

不羁少年 2024-07-26 06:06:26

当执行期间的算术/逻辑/浮点 (A/L/FP) 性能大部分接近处理器的理论峰值性能(数据由制造商提供并由处理器的特性确定)时,应用程序受 CPU 限制。处理器:核心数量、频率、寄存器、ALU、FPU 等)。

在实际应用中,查看性能很难实现,但并不是说不可能。 大多数应用程序在执行的不同部分访问内存,并且处理器在几个周期内不执行 A/L/FP 操作。 由于内存和处理器之间存在距离,这称为冯诺依曼限制

如果您想接近 CPU 峰值性能,可以采取的策略是尝试重用高速缓存中的大部分数据,以避免需要主内存中的数据。 利用此功能的算法是矩阵-矩阵乘法(如果两个矩阵都可以存储在高速缓冲存储器中)。 发生这种情况的原因是,如果矩阵的大小为 nx n,那么您需要仅使用 2 n^2 FP 数字执行大约 2 n^3 运算数据的。 另一方面,例如,与矩阵乘法相比,矩阵加法是一个较少受 CPU 限制或更多受内存限制的应用程序,因为它只需要具有相同数据的 n^2 FLOP。

下图显示了在 Intel i5-9300H 中使用简单算法进行矩阵加法和矩阵乘法获得的 FLOP:

矩阵加法和矩阵乘法之间的 FLOPs 比较

请注意,正如预期的那样,矩阵的性能乘法比矩阵加法大。 这些结果可以通过运行此 存储库

我还建议观看 J. Dongarra 提供的关于此问题的视频影响。

An application is CPU-bound when the arithmetic/logical/floating-point (A/L/FP) performance during the execution is mostly near the theoretical peak-performance of the processor (data provided by the manufacturer and determined by the characteristics of the processor: number of cores, frequency, registers, ALUs, FPUs, etc.).

The peek performance is very difficult to be achieved in real-world applications, for not saying impossible. Most of the applications access memory in different parts of the execution and the processor is not doing A/L/FP operations during several cycles. This is called Von Neumann Limitation due to the distance that exists between the memory and the processor.

If you want to be near the CPU peak-performance a strategy could be to try to reuse most of the data in the cache memory in order to avoid requiring data from the main memory. An algorithm that exploits this feature is the matrix-matrix multiplication (if both matrices can be stored in the cache memory). This happens because if the matrices are size n x n then you need to do about 2 n^3 operations using only 2 n^2 FP numbers of data. On the other hand matrix addition, for example, is a less CPU-bound or a more memory-bound application than the matrix multiplication since it requires only n^2 FLOPs with the same data.

In the following figure the FLOPs obtained with a naive algorithms for the matrix addition and the matrix multiplication in an Intel i5-9300H, is shown:

FLOPs comparison between Matrix Addition and Matrix Multiplication

Note that as expected the performance of the matrix multiplication in bigger than the matrix addition. These results can be reproduced by running test/gemm and test/matadd available in this repository.

I suggest also to see the video given by J. Dongarra about this effect.

瑶笙 2024-07-26 06:06:26

I/O 绑定进程:- 如果进程生命周期的大部分时间都处于 i/o 状态,则该进程是 ai/o 绑定进程。示例:-计算器、Internet Explorer

CPU 绑定进程:- 如果大部分时间 处于 i/o 状态,则该进程是 ai/o 绑定进程。进程的生命周期是在cpu中度过的,那么它就是cpu绑定进程。

I/O Bound process:- If most part of the lifetime of a process is spent in i/o state, then the process is a i/o bound process.example:-calculator,internet explorer

CPU Bound process:- If most part of the process life is spent in cpu,then it is cpu bound process.

许仙没带伞 2024-07-26 06:06:25

这是非常直观的:

如果 CPU 更快,程序就会运行得更快,即程序的大部分时间只是使用 CPU(进行计算),那么该程序就是 CPU 密集型计算 π 新数字的程序通常受 CPU 限制,它只是处理数字。

如果 I/O 子系统更快,程序就会运行得更快,那么该程序就是I/O 绑定。 具体指哪个 I/O 系统可能会有所不同; 我通常将其与磁盘联系起来,但当然,网络或通信通常也很常见。 在一个大文件中查找某些数据的程序可能会受到 I/O 限制,因为瓶颈是从磁盘读取数据(实际上,这个例子现在可能有点过时了) SSD 的传输速度为数百 MB/秒)。

It's pretty intuitive:

A program is CPU bound if it would go faster if the CPU were faster, i.e. it spends the majority of its time simply using the CPU (doing calculations). A program that computes new digits of π will typically be CPU-bound, it's just crunching numbers.

A program is I/O bound if it would go faster if the I/O subsystem was faster. Which exact I/O system is meant can vary; I typically associate it with the disk, but of course, networking or communication, in general, is common too. A program that looks through a huge file for some data might become I/O bound since the bottleneck is then the reading of the data from disk (actually, this example is perhaps kind of old-fashioned these days with hundreds of MB/s coming in from SSDs).

無心 2024-07-26 06:06:25

CPU 限制表示进程的处理速度受到 CPU 速度的限制。 对一小组数字执行计算(例如小矩阵相乘)的任务可能会受到 CPU 限制。

I/O 限制表示进程的进度受到 I/O 子系统速度的限制。 处理磁盘数据的任务(例如计算文件中的行数)可能会受到 I/O 限制。

内存限制是指进程进行的速度受到可用内存量和内存访问速度的限制。 处理大量内存数据的任务(例如乘以大型矩阵)可能会受到内存限制。

缓存绑定是指进程进度受可用缓存数量和速度限制的速率。 如果任务处理的数据超出了缓存的容量,则该任务将受到缓存限制。

I/O Bound 比 Memory Bound 慢,比 Cache Bound 慢,比 CPU Bound 慢。

解决 I/O 限制的方法不一定是获得更多内存。 在某些情况下,可以围绕 I/O、内存或高速缓存限制来设计访问算法。 请参阅缓存遗忘算法

CPU Bound means the rate at which process progresses is limited by the speed of the CPU. A task that performs calculations on a small set of numbers, for example multiplying small matrices, is likely to be CPU bound.

I/O Bound means the rate at which a process progresses is limited by the speed of the I/O subsystem. A task that processes data from disk, for example, counting the number of lines in a file is likely to be I/O bound.

Memory bound means the rate at which a process progresses is limited by the amount memory available and the speed of that memory access. A task that processes large amounts of in memory data, for example multiplying large matrices, is likely to be Memory Bound.

Cache bound means the rate at which a process progress is limited by the amount and speed of the cache available. A task that simply processes more data than fits in the cache will be cache bound.

I/O Bound would be slower than Memory Bound would be slower than Cache Bound would be slower than CPU Bound.

The solution to being I/O bound isn't necessarily to get more Memory. In some situations, the access algorithm could be designed around the I/O, Memory or Cache limitations. See Cache Oblivious Algorithms.

苯莒 2024-07-26 06:06:25

多线程是最重要的

在这个答案中,我将研究区分 CPU 与 IO 有限工作的一个重要用例:编写多线程代码时。

RAM I/O 绑定示例:向量求和

考虑一个对单个向量的所有值求和的程序:

#define SIZE 1000000000
unsigned int is[SIZE];
unsigned int sum = 0;
size_t i = 0;
for (i = 0; i < SIZE; i++)
    /* Each one of those requires a RAM access! */
    sum += is[i]

通过为每个核心平均分割数组来并行化,在常见的现代桌面上用处有限。

例如,在我的 Ubuntu 19.04、Lenovo ThinkPad P51 笔记本电脑上,CPU:Intel Core i7-7820HQ CPU(4 核/8 线程),RAM:2x Samsung M471A2K43BB1-CRC (2x 16GiB) 我得到如下结果:

< a href="https://gist.github.com/cirosantilli/f3403af4bd1d2dd1bed042ac89f55150#file-pthread_array_sum-dat" rel="noreferrer">绘制数据。

但请注意,运行之间存在很大差异。 但我无法进一步增加数组大小,因为我已经达到 8GiB,而且今天我没有心情进行多次运行的统计数据。 然而,这似乎是经过多次手动运行后的典型运行。

基准代码:

我对计算机了解不够架构完全解释了曲线的形状,但有一点很清楚:由于我使用了所有 8 个线程,计算速度并没有像天真的预期那样快 8 倍! 出于某种原因,2 个和 3 个线程是最佳选择,添加更多线程只会使速度变慢。

将此与 CPU 密集型工作进行比较,后者实际上确实快了 8 倍:time(1) 的输出中“real”、“user”和“sys”是什么意思?

原因是所有处理器共享单个内存总线链接到 RAM:

CPU 1   --\    Bus    +-----+
CPU 2   ---\__________| RAM |
...     ---/          +-----+
CPU N   --/

因此内存总线很快就会成为瓶颈,而不是 CPU。

发生这种情况是因为添加两个数字需要一个 CPU 周期,内存读取大约需要 100 个 CPU 2016 年硬件周期

因此,每个输入数据字节完成的 CPU 工作量太小,我们称其为 IO 密集型进程。

进一步加速计算的唯一方法是使用新的内存硬件加速单个内存访问,例如 多通道内存

例如,升级到更快的 CPU 时钟并不是很有用。

其他示例

  • 矩阵乘法在 RAM 和 GPU 上受 CPU 限制。 输入包含:

    <前><代码>2 * N**2

    数字,但是:

    <前><代码>N ** 3

    乘法已经完成,这足以让并行化对于实际的大 N 来说是值得的。

    这就是为什么存在如下并行 CPU 矩阵乘法库的原因:

    缓存的使用对实现速度有很大影响。 例如,请参阅教学 GPU 比较示例

    另请参阅:

  • 网络是典型的 IO 绑定示例。

    即使我们发送一个字节的数据,仍然需要很长时间才能到达目的地。

    并行化小型网络请求(例如 HTTP 请求)可以带来巨大的性能提升。

    如果网络已经满负荷(例如下载种子),并行化仍然可以增加延迟(例如您可以“同时”加载网页)。

  • 一个虚拟的 C++ CPU 绑定操作,需要一个数字并对其进行大量处理:

  • 排序似乎是基于以下实验的CPU:C++17 并行算法已经实现了吗? 显示并行排序的性能提高了 4 倍,但我也想得到更多理论确认

  • 众所周知的Coremark 基准来自 EEMBC 明确检查一系列问题的扩展程度。 示例基准结果清理表明:

    工作负载名称 (iter/s) (iter/s) 缩放 
      ----------------------------------------------------------- --- ------- ---------- ---------- 
      cjpeg-rose7-预设 526.32 178.57 2.95 
      核心 7.39 2.16 3.42 
      线性_alg-中-100x100-sp 684.93 238.10 2.88 
      循环所有中 10k-sp 27.65 7.80 3.54 
      nnet_测试 32.79 10.57 3.10 
      解析器-125k 71.43 25.00 2.86 
      基数2-大-64k 2320.19 623.44 3.72 
      SHA 测试 555​​.56 227.27 2.44 
      压缩测试 363.64 166.67 2.18 
    
      标记结果表 
    
      标记名称 多核 单核 扩展 
      ----------------------------------------------------------- --- ------- ---------- ---------- 
      CoreMark-PRO 18743.79 6306.76 2.97 
      
  • C++ 程序的链接可以在一定程度上并行化:Can gcc use multiple cores when链接?

  • < a href="https://en.wikipedia.org/wiki/Rigid_body_dynamics" rel="noreferrer">刚体动力学,有时用于游戏引擎,可以在 GPU 上并行化和加速,例如 < a href="https://mujoco.readthedocs.io/en/stable/mjx.html" rel="noreferrer">MuJoCo 和 PhysX 都支持

如何确定您是否受 CPU 或 IO 限制

非 RAM IO 限制(如磁盘、网络):ps aux,然后检查是否CPU% / 100 n 个线程。 如果是,则您是 IO 绑定的,例如阻塞读取只是在等待数据,并且调度程序正在跳过该过程。 然后使用诸如 sudo iotop 之类的进一步工具来确定到底是哪个 IO 出现了问题。

或者,如果执行速度很快,并且您参数化了线程数量,那么您可以从时间中轻松看出,随着 CPU 密集型工作的线程数量增加,性能会提高:“真实”、“用户”和“什么” sys' 在 time(1) 的输出中意味着什么?

RAM-IO 限制:很难判断,因为 RAM 等待时间包含在 CPU% 测量中,另请参阅:

一些选项:

GPU

当您首次将输入数据从常规 CPU 可读 RAM 传输到 GPU 时,GPU 会出现 IO 瓶颈。

因此,对于 CPU 密集型应用程序来说,GPU 只能比 CPU 更好。

然而,一旦数据传输到 GPU,它对这些字节的操作速度就会比 CPU 更快,因为 GPU:

  • 比大多数 CPU 系统具有更多的数据本地化能力,因此某些内核的数据访问速度比其他内核更快

  • 利用数据并行性,并通过跳过任何尚未准备好立即操作的数据来牺牲延迟。

    由于 GPU 必须对大型并行输入数据进行操作,因此最好跳到可能可用的下一个数据,而不是像 CPU 那样等待当前数据可用并阻止所有其他操作

因此 GPU如果您的应用程序满足以下条件,则可以比 CPU 更快:

  • 可以高度并行化:不同的数据块可以彼此分开处理,同时
  • 需要每个输入字节进行足够多的操作(与矢量加法不同,矢量加法每个数据块执行一次加法)仅字节)
  • 有大量的输入字节

这些设计选择最初是针对 3D 渲染的应用,其主要步骤如下所示 OpenGL 中的着色器是什么以及我们需要它们做什么?

  • 顶点着色器:乘以一堆 1x4通过 4x4 矩阵
  • 片段着色器生成向量:根据三角形中每个像素的相对位置计算三角形的颜色

,因此我们得出结论,这些应用程序受 CPU 限制。

随着可编程 GPGPU 的出现,我们可以观察到多个 GPGPU 应用程序作为 CPU 绑定操作的示例:

另请参阅:

CPython 全局解释器锁 (GIL)

作为快速案例研究,我想指出Python全局解释器锁(GIL):CPython 中的全局解释器锁 (GIL) 是什么?

此 CPython 实现细节可防止多个 Python 线程有效地使用 CPU 密集型工作。 CPython 文档 说:

CPython 实现细节:在 CPython 中,由于全局解释器锁,一次只有一个线程可以执行 Python 代码(尽管某些面向性能的库可能会克服这一限制)。 如果您希望应用程序更好地利用多核机器的计算资源,建议您使用multiprocessingconcurrent.futures.ProcessPoolExecutor。 但是,如果您想同时运行多个 I/O 密集型任务,线程仍然是一个合适的模型。

因此,这里我们有一个示例,CPU 绑定的内容不适合,而 I/O 绑定的内容适合。

JavaScript async 和 Node.js worker_threads

情况与 Python 类似。

JavaScript 本质上是单线程的。 不确定这是否是语言标准的一部分(对于 Python 来说不是,除了参考 CPython 实现 AFAIK 之外甚至没有语言标准)。

JavaScript 有 async 关键字,它允许执行停止,然后开始执行其他事情。 您可以编写如下内容:

async function myfunc(init) {
  ret = init
  for (let i = 0; i < 1000000; i++) {
    ret += i*i + 2*i + 3
  }
  return ret;
}

async function doNetworkRequest(init) {
  // Some native method that gets network data.
}

const [first, second, myfunc_ret1, myfunc_ret2] = await Promise.all([
  doNetworkRequest('example.com'),
  doNetworkRequest('example2.com'),
  myfunc(1),
  myfunc(2),
])

其中 await 表示“等待所有这些异步事情完成后再继续”。

但是,CPU 上一次只能运行一个async 方法,因此 myfunc 的 CPU 密集型工作根本不会因此而加快。

然而,网络访问的原型 IO 绑定操作可以加快速度,因为这会依次触发两个网络请求,并且在服务器/网络执行工作时等待两者返回。

事实上,语言中有一个专门用于此目的的实际关键字 async,这一事实说明了这一点:毫不奇怪,网络请求在浏览器主导的 JavaScript 环境中非常重要。

然而,随着 Node.js 的出现,人们开始越来越希望并行化 CPU 工作负载,他们得出了与 CPython 类似的解决方案:创建单独的进程而不是线程。 这是通过 worker_threads,其中:

  • 实际上将 JavaScript 脚本路径作为输入,并启动一个新的解释器实例来运行它,
  • 因为进程不共享内存空间,您可以使用消息传递系统在实例之间传递消息

https://medium.com/@Trott/using-worker-threads-in-node-js- 80494136dbb6 包含一个很好的例子。

worker_threads 的文档再次说明了本答案其他地方提到的差异:

工作线程(线程)对于执行 CPU 密集型 JavaScript 操作非常有用。 它们对于 I/O 密集型工作没有多大帮助。 Node.js 内置的异步 I/O 操作比 Workers 更高效。

在浏览器上还有 Web Workers,不确定它与 Node 的实现相比如何:Web Workers 和 Worker Threads 之间有什么区别?

Multi-threading is where it tends to matter the most

In this answer, I will investigate one important use case of distinguishing between CPU vs IO bounded work: when writing multi-threaded code.

RAM I/O bound example: Vector Sum

Consider a program that sums all the values of a single vector:

#define SIZE 1000000000
unsigned int is[SIZE];
unsigned int sum = 0;
size_t i = 0;
for (i = 0; i < SIZE; i++)
    /* Each one of those requires a RAM access! */
    sum += is[i]

Parallelizing that by splitting the array equally for each of your cores is of limited usefulness on common modern desktops.

For example, on my Ubuntu 19.04, Lenovo ThinkPad P51 laptop with CPU: Intel Core i7-7820HQ CPU (4 cores / 8 threads), RAM: 2x Samsung M471A2K43BB1-CRC (2x 16GiB) I get results like this:

Plot data.

Note that there is a lot of variance between run however. But I can't increase the array size much further since I'm already at 8GiB, and I'm not in the mood for statistics across multiple runs today. This seemed however like a typical run after doing many manual runs.

Benchmark code:

I don't know enough computer architecture to fully explain the shape of the curve, but one thing is clear: the computation does not become 8x faster as naively expected due to me using all my 8 threads! For some reason, 2 and 3 threads was the optimum, and adding more just makes things much slower.

Compare this to CPU bound work, which actually does get 8 times faster: What do 'real', 'user' and 'sys' mean in the output of time(1)?

The reason it is all processors share a single memory bus linking to RAM:

CPU 1   --\    Bus    +-----+
CPU 2   ---\__________| RAM |
...     ---/          +-----+
CPU N   --/

so the memory bus quickly becomes the bottleneck, not the CPU.

This happens because adding two numbers takes a single CPU cycle, memory reads take about 100 CPU cycles in 2016 hardware.

So the CPU work done per byte of input data is too small, and we call this an IO-bound process.

The only way to speed up that computation further, would be to speed up individual memory accesses with new memory hardware, e.g. Multi-channel memory.

Upgrading to a faster CPU clock for example would not be very useful.

Other examples

  • matrix multiplication is CPU-bound on RAM and GPUs. The input contains:

    2 * N**2
    

    numbers, but:

    N ** 3
    

    multiplications are done, and that is enough for parallelization to be worth it for practical large N.

    This is why parallel CPU matrix multiplication libraries like the following exist:

    Cache usage makes a big difference to the speed of implementations. See for example this didactic GPU comparison example.

    See also:

  • Networking is the prototypical IO-bound example.

    Even when we send a single byte of data, it still takes a large time to reach it's destination.

    Parallelizing small network requests like HTTP requests can offer a huge performance gains.

    If the network is already at full capacity (e.g. downloading a torrent), parallelization can still increase improve the latency (e.g. you can load a web page "at the same time").

  • A dummy C++ CPU bound operation that takes one number and crunches it a lot:

  • Sorting appears to be CPU based on the following experiment: Are C++17 Parallel Algorithms implemented already? which showed a 4x performance improvement for parallel sort, but I would like to have a more theoretical confirmation as well

  • The well known Coremark benchmark from EEMBC explicitly checks how well a suite of problems scale. Sample benchmark result clearing showing that:

    Workload Name                                     (iter/s)   (iter/s)    Scaling
    ----------------------------------------------- ---------- ---------- ----------
    cjpeg-rose7-preset                                  526.32     178.57       2.95
    core                                                  7.39       2.16       3.42
    linear_alg-mid-100x100-sp                           684.93     238.10       2.88
    loops-all-mid-10k-sp                                 27.65       7.80       3.54
    nnet_test                                            32.79      10.57       3.10
    parser-125k                                          71.43      25.00       2.86
    radix2-big-64k                                     2320.19     623.44       3.72
    sha-test                                            555.56     227.27       2.44
    zip-test                                            363.64     166.67       2.18
    
    MARK RESULTS TABLE
    
    Mark Name                                        MultiCore SingleCore    Scaling
    ----------------------------------------------- ---------- ---------- ----------
    CoreMark-PRO                                      18743.79    6306.76       2.97
    
  • the linking of a C++ program can be parallelized to a certain degree: Can gcc use multiple cores when linking?

  • rigid body dynamics, sometimes used in game engines, can be parallelized and sped up on the GPU, e.g. MuJoCo and PhysX both support it

How to find out if you are CPU or IO bound

Non-RAM IO bound like disk, network: ps aux, then check if CPU% / 100 < n threads. If yes, you are IO bound, e.g. blocking reads are just waiting for data and the scheduler is skipping that process. Then use further tools like sudo iotop to decide which IO is the problem exactly.

Or, if execution is quick, and you parametrize the number of threads, you can see it easily from time that performance improves as the number of threads increases for CPU bound work: What do 'real', 'user' and 'sys' mean in the output of time(1)?

RAM-IO bound: harder to tell, as RAM wait time it is included in CPU% measurements, see also:

Some options:

GPUs

GPUs have an IO bottleneck when you first transfer the input data from the regular CPU readable RAM to the GPU.

Therefore, GPUs can only be better than CPUs for CPU bound applications.

Once the data is transferred to the GPU however, it can operate on those bytes faster than the CPU can, because the GPU:

  • has more data localization than most CPU systems, and so data can be accessed faster for some cores than others

  • exploits data parallelism and sacrifices latency by just skipping over any data that is not ready to be operated on immediately.

    Since the GPU has to operate on large parallel input data, it is better to just skip to the next data that might be available instead of waiting for the current data to be come available and block all other operations like the CPU mostly does

Therefore the GPU can be faster then a CPU if your application:

  • can be highly parallelized: different chunks of data can be treated separately from one another at the same time
  • requires a large enough number of operations per input byte (unlike e.g. vector addition which does one addition per byte only)
  • there is a large number of input bytes

These designs choices originally targeted the application of 3D rendering, whose main steps are as shown at What are shaders in OpenGL and what do we need them for?

  • vertex shader: multiplying a bunch of 1x4 vectors by a 4x4 matrix
  • fragment shader: calculate the color of each pixel of a triangle based on its relative position withing the triangle

and so we conclude that those applications are CPU-bound.

With the advent of programmable GPGPU, we can observe several GPGPU applications that serve as examples of CPU bound operations:

See also:

CPython Global Intepreter Lock (GIL)

As a quick case study, I want to point out to the Python Global Interpreter Lock (GIL): What is the global interpreter lock (GIL) in CPython?

This CPython implementation detail prevents multiple Python threads from efficiently using CPU-bound work. The CPython docs say:

CPython implementation detail: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing or concurrent.futures.ProcessPoolExecutor. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.

Therefore, here we have an example where CPU-bound content is not suitable and I/O bound is.

JavaScript async and Node.js worker_threads

The situation is similar to Python's.

JavaScript is fundamentally single threaded. Not sure if this is part of the language standard or not (for Python it is not, there isn't even a language standard besides the reference CPython implementation AFAIK).

JavaScript has the async keyword which allows execution to stall, and it then it start executing something else. You can write stuff like:

async function myfunc(init) {
  ret = init
  for (let i = 0; i < 1000000; i++) {
    ret += i*i + 2*i + 3
  }
  return ret;
}

async function doNetworkRequest(init) {
  // Some native method that gets network data.
}

const [first, second, myfunc_ret1, myfunc_ret2] = await Promise.all([
  doNetworkRequest('example.com'),
  doNetworkRequest('example2.com'),
  myfunc(1),
  myfunc(2),
])

where await says "wait for all these async things to finish before moving on".

However, only one of the async methods can run at a time on your CPU, so the CPU intensive work of myfunc does not get sped up by this at all.

The prototypically IO bound operation of network access however can get sped up, as this would fire both network requests one after the other, and just wait for both to return while the servers/network do the work.

The fact that there is an actual keyword in the language dedicated for this, async, is telling: unsurprisingly, network requests are very important in the browser dominated context of JavaScript.

With the advent of Node.js however, people started to want to parallelize CPU workload as well more and more, and they reached a similar solution to CPython: create separate processes rather than threads. This is done with the worker_threads library, which:

  • actually takes a JavaScript script path as input, and launches a new interpreter instance to run it
  • since the processes don't share memory space, you pass messages between instances with a messaging system

https://medium.com/@Trott/using-worker-threads-in-node-js-80494136dbb6 contains a good example.

The documentation of worker_threads once more states the difference as mentioned elsewhere in this answer:

Workers (threads) are useful for performing CPU-intensive JavaScript operations. They do not help much with I/O-intensive work. The Node.js built-in asynchronous I/O operations are more efficient than Workers can be.

On the browser there's also Web Workers, not sure how it compares to Node's implementation: What's the difference between Web Workers and Worker Threads?

南薇 2024-07-26 06:06:25

CPU 限制意味着程序受到 CPU 或中央处理单元的瓶颈,而 I/O 绑定意味着程序受到I/O或输入/输出的瓶颈,例如读取或写入磁盘、网络等。

一般来说,在优化计算机程序时,人们试图找出瓶颈并消除它。 了解您的程序受 CPU 限制会有所帮助,这样就不会不必要地优化其他内容。

[我所说的“瓶颈”是指让你的程序运行得比原本慢的事情。]

CPU bound means the program is bottlenecked by the CPU, or central processing unit, while I/O bound means the program is bottlenecked by I/O, or input/output, such as reading or writing to disk, network, etc.

In general, when optimizing computer programs, one tries to seek out the bottleneck and eliminate it. Knowing that your program is CPU bound helps, so that one doesn't unnecessarily optimize something else.

[And by "bottleneck", I mean the thing that makes your program go slower than it otherwise would have.]

趴在窗边数星星i 2024-07-26 06:06:25

另一种表述相同想法的方式是:

  • 如果加速 CPU 并不能加快程序速度,则可能是 I/O 限制。

  • 如果加速 I/O(例如使用更快的磁盘)没有帮助,则您的程序可能受 CPU 限制。

(我使用“可能是”是因为您需要考虑其他资源。内存就是一个例子。)

Another way to phrase the same idea:

  • If speeding up the CPU doesn't speed up your program, it may be I/O bound.

  • If speeding up the I/O (e.g. using a faster disk) doesn't help, your program may be CPU bound.

(I used "may be" because you need to take other resources into account. Memory is one example.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文