当前位置：文江博客话题详情

什么是“矢量化”？

发布于 2024-08-04 21:28:24 字数 100 浏览 15 评论 0原文

现在好几次了，我在 matlab、fortran ……其他一些……中遇到这个术语，但我从未找到解释它是什么意思，它有什么作用？所以我在这里问，什么是矢量化，例如“循环矢量化”是什么意思？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

执着的年纪 2024-08-11 21:28:25

许多CPU具有“向量”或“SIMD”指令集，它们同时对两个、四个或更多个数据块应用相同的操作。现代x86芯片有SSE指令，许多PPC芯片有“Altivec”指令，甚至一些ARM芯片有向量指令集，称为NEON。

“矢量化”（简化）是重写循环的过程，这样它就不会处理数组的单个元素 N 次，而是同时处理（比如说）数组的 4 个元素 N/4 次。

我选择 4 是因为它是现代硬件最有可能直接支持 32 位浮点数或整数的值。

矢量化和循环展开之间的区别：
考虑以下非常简单的循环，它将两个数组的元素相加并将结果存储到第三个数组。

for (int i=0; i<16; ++i)
    C[i] = A[i] + B[i];

展开此循环会将其转换为如下所示的内容：

for (int i=0; i<16; i+=4) {
    C[i]   = A[i]   + B[i];
    C[i+1] = A[i+1] + B[i+1];
    C[i+2] = A[i+2] + B[i+2];
    C[i+3] = A[i+3] + B[i+3];
}

另一方面，对其进行向量化会产生如下所示的内容：

for (int i=0; i<16; i+=4)
    addFourThingsAtOnceAndStoreResult(&C[i], &A[i], &B[i]);

其中“addFourThingsAtOnceAndStoreResult”是编译器用于指定向量指令的任何内部函数的占位符。

术语：

请注意，大多数现代提前编译器都能够自动矢量化像这样的非常简单的循环，这通常可以通过编译选项启用（默认情况下在现代 C 和C++ 编译器，例如 gcc -O3 -march=native）。 OpenMP #pragma omp simd 有时有助于提示编译器，特别是对于“归约”循环，例如对 FP 数组求和，其中矢量化需要假装 FP 数学是关联的。

更复杂的算法仍然需要程序员的帮助来生成良好的矢量代码；我们称之为手动矢量化，通常使用诸如 x86 _mm_add_ps 之类的内在函数来映射到单个机器指令，如 Intel cpu 上的 SIMD 前缀总和或如何使用 SIMD 计算字符出现次数。或者甚至使用 SIMD 来解决简短的非循环问题，例如将 9 个字符数字转换为 int 或无符号 int 的最疯狂的最快方法或如何将二进制整数转换为十六进制字符串？

还使用术语“向量化” 来描述更高级别的软件转换，您可以将循环完全抽象出来，只描述对数组的操作，而不是对组成数组的元素进行操作。例如，用某种语言编写 C = A + B ，当这些是数组或矩阵时，允许这样做，这与 C 或 C++ 不同。在这样的低级语言中，您可以描述调用 BLAS 或 Eigen 库函数，而不是作为矢量化编程风格手动编写循环。关于这个问题的其他一些答案集中在矢量化和高级语言的含义上。

Many CPUs have "vector" or "SIMD" instruction sets which apply the same operation simultaneously to two, four, or more pieces of data. Modern x86 chips have the SSE instructions, many PPC chips have the "Altivec" instructions, and even some ARM chips have a vector instruction set, called NEON.

"Vectorization" (simplified) is the process of rewriting a loop so that instead of processing a single element of an array N times, it processes (say) 4 elements of the array simultaneously N/4 times.

I chose 4 because it's what modern hardware is most likely to directly support for 32-bit floats or ints.

The difference between vectorization and loop unrolling:
Consider the following very simple loop that adds the elements of two arrays and stores the results to a third array.

for (int i=0; i<16; ++i)
    C[i] = A[i] + B[i];

Unrolling this loop would transform it into something like this:

for (int i=0; i<16; i+=4) {
    C[i]   = A[i]   + B[i];
    C[i+1] = A[i+1] + B[i+1];
    C[i+2] = A[i+2] + B[i+2];
    C[i+3] = A[i+3] + B[i+3];
}

Vectorizing it, on the other hand, produces something like this:

for (int i=0; i<16; i+=4)
    addFourThingsAtOnceAndStoreResult(&C[i], &A[i], &B[i]);

Where "addFourThingsAtOnceAndStoreResult" is a placeholder for whatever intrinsic(s) your compiler uses to specify vector instructions.

Terminology:

Note that most modern ahead-of-time compilers are able to auto vectorize very simple loops like this, which can often be enabled via a compile option (on by default with full optimization in modern C and C++ compilers, like gcc -O3 -march=native). OpenMP #pragma omp simd is sometimes helpful to hint the compiler, especially for "reduction" loops like summing an FP array where vectorization requires pretending that FP math is associative.

More complex algorithms still require help from the programmer to generate good vector code; we call this manual vectorization, often with intrinsics like x86 _mm_add_ps that map to a single machine instruction as in SIMD prefix sum on Intel cpu or How to count character occurrences using SIMD. Or even use SIMD for short non-looping problems like Most insanely fastest way to convert 9 char digits into an int or unsigned int or How to convert a binary integer number to a hex string?

The term "vectorization" is also used to describe a higher level software transformation where you might just abstract away the loop altogether and just describe operating on arrays instead of the elements that comprise them. e.g. writing C = A + B in some language that allows that when those are arrays or matrices, unlike C or C++. In lower-level languages like that, you could describe calling BLAS or Eigen library functions instead of manually writing loops as a vectorized programming style. Some other answers on this question focus on that meaning of vectorization, and higher-level languages.

回复收藏 0 原文

盛夏尉蓝 2024-08-11 21:28:25

矢量化是将标量程序转换为矢量程序的术语。矢量化程序可以从一条指令运行多个操作，而标量只能一次对操作数对进行操作。

来自 wikipedia：

标量方法：

for (i = 0; i < 1024; i++)
{
   C[i] = A[i]*B[i];
}

矢量化方法：

for (i = 0; i < 1024; i+=4)
{
   C[i:i+3] = A[i:i+3]*B[i:i+3];
}

Vectorization is the term for converting a scalar program to a vector program. Vectorized programs can run multiple operations from a single instruction, whereas scalar can only operate on pairs of operands at once.

From wikipedia:

Scalar approach:

for (i = 0; i < 1024; i++)
{
   C[i] = A[i]*B[i];
}

Vectorized approach:

for (i = 0; i < 1024; i+=4)
{
   C[i:i+3] = A[i:i+3]*B[i:i+3];
}

回复收藏 0 原文

坏尐絯℡ 2024-08-11 21:28:25

矢量化广泛应用于需要高效处理大量数据的科学计算中。

在实际的编程应用程序中，我知道它在 NUMPY 中使用（不确定其他）。

Numpy（Python 中的科学计算包）使用向量化来快速操作 n 维数组，如果使用内置的 Python 选项来处理数组，通常会更慢。

尽管有大量的解释，但这里的向量化在NUMPY DOCUMENTATION PAGE中定义的内容

向量化描述了代码中没有任何显式循环、索引等 -当然，这些事情只是在优化的预编译 C 代码的“幕后”发生。矢量化代码有很多优点，其中包括：

矢量化代码更简洁，更易于阅读
更少的代码行通常意味着更少的错误
代码更接近标准的数学符号
（通常可以更容易地正确编码数学
构造）
向量化会产生更多“Pythonic”代码。没有
向量化，我们的代码将充满低效和
for 循环难以阅读。

回复收藏 0 原文

染年凉城似染瑾 2024-08-11 21:28:25

它指的是在一个步骤中对数字列表（或“向量”）进行单个数学运算的能力。你经常在 Fortran 中看到它，因为它与科学计算相关，而科学计算与超级计算相关，矢量化算术首次出现在超级计算中。如今，几乎所有桌面 CPU 都通过英特尔 SSE 等技术提供某种形式的矢量化算术。 GPU 还提供一种矢量化算术形式。

回复收藏 0 原文

眼眸里的快感 2024-08-11 21:28:25

简单来说，矢量化意味着优化算法，使其能够利用处理器中的 SIMD 指令。

AVX、AVX2 和 AVX512 是在一条指令中对多个数据执行相同操作的指令集（intel）。例如。 AVX512 意味着您可以一次操作 16 个整数值（4 个字节）。这意味着，如果您有 16 个整数的向量，并且您希望将每个整数中的值加倍，然后加上 10。您可以将值加载到通用寄存器 [a,b,c] 16 次并执行相同的操作，也可以通过将所有 16 个值加载到 SIMD 寄存器 [xmm,ymm] 并执行一次操作来执行相同的操作。这可以加快矢量数据的计算速度。

在矢量化中，我们通过重构数据来利用这一点，以便我们可以对其执行 SIMD 操作并加快程序速度。

矢量化的唯一问题是处理条件。因为条件会分支执行流程。这可以通过屏蔽来处理。通过将条件建模为算术运算。例如。如果我们想在 value 大于 100 的情况下加 10。我们都可以。

if(x[i] > 100) x[i] += 10; // this will branch execution flow.

或者我们可以将条件建模为算术运算，创建条件向量 c，

c[i] = x[i] > 100; // storing the condition on masking vector
x[i] = x[i] + (c[i] & 10) // using mask

虽然这是一个非常简单的例子...因此，c 是我们的掩码向量，我们用它来根据其值执行二元运算。这避免了执行流的分支并实现矢量化。

矢量化与并行化同样重要。因此，我们应该尽可能地利用它。所有现代处理器都具有用于繁重计算工作负载的 SIMD 指令。我们可以通过矢量化来优化我们的代码以使用这些 SIMD 指令，这类似于并行化我们的代码以在现代处理器上可用的多个内核上运行。

最后我想提一下 OpenMP，它可以让您使用编译指示对代码进行矢量化。我认为这是一个很好的起点。 OpenACC 也是如此。

Vectorization, in simple words, means optimizing the algorithm so that it can utilize SIMD instructions in the processors.

AVX, AVX2 and AVX512 are the instruction sets (intel) that perform same operation on multiple data in one instruction. for eg. AVX512 means you can operate on 16 integer values(4 bytes) at a time. What that means is that if you have vector of 16 integers and you want to double that value in each integers and then add 10 to it. You can either load values on to general register [a,b,c] 16 times and perform same operation or you can perform same operation by loading all 16 values on to SIMD registers [xmm,ymm] and perform the operation once. This lets speed up the computation of vector data.

In vectorization we use this to our advantage, by remodelling our data so that we can perform SIMD operations on it and speed up the program.

Only problem with vectorization is handling conditions. Because conditions branch the flow of execution. This can be handled by masking. By modelling the condition into an arithmetic operation. eg. if we want to add 10 to value if it is greater then 100. we can either.

if(x[i] > 100) x[i] += 10; // this will branch execution flow.

or we can model the condition into arithmetic operation creating a condition vector c,

c[i] = x[i] > 100; // storing the condition on masking vector
x[i] = x[i] + (c[i] & 10) // using mask

this is very trivial example though... thus, c is our masking vector which we use to perform binary operation based on its value. This avoid branching of execution flow and enables vectorization.

Vectorization is as important as Parallelization. Thus, we should make use of it as much possible. All modern days processors have SIMD instructions for heavy compute workloads. We can optimize our code to use these SIMD instructions using vectorization, this is similar to parrallelizing our code to run on multiple cores available on modern processors.

I would like to leave with the mention of OpenMP, which lets yo vectorize the code using pragmas. I consider it as a good starting point. Same can be said for OpenACC.

回复收藏 0 原文

初懵 2024-08-11 21:28:25

我认为英特尔人很容易掌握。

矢量化是将算法从操作转换为
一次对一个值进行操作一次对一组值进行操作
时间。现代 CPU 直接支持矢量运算，其中
单指令应用于多数据（SIMD）。
例如，具有 512 位寄存器的 CPU 可以容纳 16 个 32 位
单精度双精度并进行一次计算。
比一次执行一条指令快 16 倍。结合
对于线程和多核 CPU，这会导致数量级的提高
性能提升。

链接 https ://software.intel.com/en-us/articles/vectorization-a-key-tool-to-improve-performance-on-modern-cpus

在 Java 中，可以选择将其包含在 JDK 中2020 年 15 月 15 日或 2021 年 JDK 16 后期。请参阅此官方问题。

回复收藏 0 原文

感悟人生的甜 2024-08-11 21:28:25

希望你一切都好！

矢量化是指将缩放器实现（其中单个操作一次处理单个实体）转换为矢量实现（其中单个操作同时处理多个实体）的所有技术。

矢量化是一种技术，借助它我们可以优化代码以有效地处理大量数据。矢量化在 NumPy、pandas 等科学应用中的应用，您也可以在使用 Matlab、图像处理、NLP 等时使用此技术。总的来说，它优化了程序的运行时间和内存分配。

希望您能得到答案！

谢谢。

回复收藏 0 原文

聚集的泪 2024-08-11 21:28:25

我将定义向量化给定语言的一个功能，其中如何迭代某个集合的元素的责任可以从程序员（例如元素的显式循环）委托给由语言（例如隐式循环）。

现在，我们为什么要这么做？

代码可读性。对于某些（但不是全部！）情况，一次对整个集合进行操作而不是对其元素进行操作更容易阅读并且更快地编写代码；
一些解释性语言（R、Python、Matlab...但不是 Julia）在处理显式循环方面确实很慢。在这些情况下，向量化在底层使用编译指令来进行这些“元素顺序处理”，并且可以比处理每个程序员指定的循环操作快几个数量级；
大多数现代 CPU（以及现在的 GPU）都具有内置并行化功能，当我们使用语言提供的矢量化方法而不是我们自己实现的元素操作顺序时，可以利用这种并行化功能；
以类似的方式，我们选择的编程语言可能会用于一些矢量化操作（例如矩阵操作）软件库（例如BLAS/LAPACK），这些软件库利用CPU的多线程功能，这是并行计算的另一种形式。

请注意，对于第 3 点和第 4 点，某些语言（尤其是 Julia）允许使用程序员定义的顺序处理（例如 for 循环）来利用这些硬件并行化，但是当使用语言提供的矢量化方法。

现在，虽然矢量化有很多优点，但有时使用显式循环比矢量化更直观地表达算法（也许我们需要诉诸复杂的线性代数运算、恒等和对角矩阵......所有这些都是为了保留我们的“矢量化”方法），如果使用显式排序形式没有计算上的缺点，则应首选此形式。