什么是“矢量化”?

发布于 2024-08-04 21:28:24 字数 100 浏览 7 评论 0原文

现在好几次了,我在 matlab、fortran ……其他一些……中遇到这个术语,但我从未找到解释它是什么意思,它有什么作用?所以我在这里问,什么是矢量化,例如“循环矢量化”是什么意思?

Several times now, I've encountered this term in matlab, fortran ... some other ... but I've never found an explanation what does it mean, and what it does? So I'm asking here, what is vectorization, and what does it mean for example, that "a loop is vectorized" ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

执着的年纪 2024-08-11 21:28:25

许多CPU具有“向量”或“SIMD”指令集,它们同时对两个、四个或更多个数据块应用相同的操作。现代x86芯片有SSE指令,许多PPC芯片有“Altivec”指令,甚至一些ARM芯片有向量指令集,称为NEON。

“矢量化”(简化)是重写循环的过程,这样它就不会处理数组的单个元素 N 次,而是同时处理(比如说)数组的 4 个元素 N/4 次。

我选择 4 是因为它是现代硬件最有可能直接支持 32 位浮点数或整数的值。


矢量化和循环展开之间的区别:
考虑以下非常简单的循环,它将两个数组的元素相加并将结果存储到第三个数组。

for (int i=0; i<16; ++i)
    C[i] = A[i] + B[i];

展开此循环会将其转换为如下所示的内容:

for (int i=0; i<16; i+=4) {
    C[i]   = A[i]   + B[i];
    C[i+1] = A[i+1] + B[i+1];
    C[i+2] = A[i+2] + B[i+2];
    C[i+3] = A[i+3] + B[i+3];
}

另一方面,对其进行向量化会产生如下所示的内容:

for (int i=0; i<16; i+=4)
    addFourThingsAtOnceAndStoreResult(&C[i], &A[i], &B[i]);

其中“addFourThingsAtOnceAndStoreResult”是编译器用于指定向量指令的任何内部函数的占位符。


术语:

请注意,大多数现代提前编译器都能够自动矢量化像这样的非常简单的循环,这通常可以通过编译选项启用(默认情况下在现代 C 和C++ 编译器,例如 gcc -O3 -march=native)。 OpenMP #pragma omp simd 有时有助于提示编译器,特别是对于“归约”循环,例如对 FP 数组求和,其中矢量化需要假装 FP 数学是关联的。

更复杂的算法仍然需要程序员的帮助来生成良好的矢量代码;我们称之为手动矢量化,通常使用诸如 x86 _mm_add_ps 之类的内在函数来映射到单个机器指令,如 Intel cpu 上的 SIMD 前缀总和如何使用 SIMD 计算字符出现次数。或者甚至使用 SIMD 来解决简短的非循环问题,例如 将 9 个字符数字转换为 int 或无符号 int 的最疯狂的最快方法如何将二进制整数转换为十六进制字符串?

还使用术语“向量化” 来描述更高级别的软件转换,您可以将循环完全抽象出来,只描述对数组的操作,而不是对组成数组的元素进行操作。例如,用某种语言编写 C = A + B ,当这些是数组或矩阵时,允许这样做,这与 C 或 C++ 不同。在这样的低级语言中,您可以描述调用 BLAS 或 Eigen 库函数,而不是作为矢量化编程风格手动编写循环。关于这个问题的其他一些答案集中在矢量化和高级语言的含义上。

Many CPUs have "vector" or "SIMD" instruction sets which apply the same operation simultaneously to two, four, or more pieces of data. Modern x86 chips have the SSE instructions, many PPC chips have the "Altivec" instructions, and even some ARM chips have a vector instruction set, called NEON.

"Vectorization" (simplified) is the process of rewriting a loop so that instead of processing a single element of an array N times, it processes (say) 4 elements of the array simultaneously N/4 times.

I chose 4 because it's what modern hardware is most likely to directly support for 32-bit floats or ints.


The difference between vectorization and loop unrolling:
Consider the following very simple loop that adds the elements of two arrays and stores the results to a third array.

for (int i=0; i<16; ++i)
    C[i] = A[i] + B[i];

Unrolling this loop would transform it into something like this:

for (int i=0; i<16; i+=4) {
    C[i]   = A[i]   + B[i];
    C[i+1] = A[i+1] + B[i+1];
    C[i+2] = A[i+2] + B[i+2];
    C[i+3] = A[i+3] + B[i+3];
}

Vectorizing it, on the other hand, produces something like this:

for (int i=0; i<16; i+=4)
    addFourThingsAtOnceAndStoreResult(&C[i], &A[i], &B[i]);

Where "addFourThingsAtOnceAndStoreResult" is a placeholder for whatever intrinsic(s) your compiler uses to specify vector instructions.


Terminology:

Note that most modern ahead-of-time compilers are able to auto vectorize very simple loops like this, which can often be enabled via a compile option (on by default with full optimization in modern C and C++ compilers, like gcc -O3 -march=native). OpenMP #pragma omp simd is sometimes helpful to hint the compiler, especially for "reduction" loops like summing an FP array where vectorization requires pretending that FP math is associative.

More complex algorithms still require help from the programmer to generate good vector code; we call this manual vectorization, often with intrinsics like x86 _mm_add_ps that map to a single machine instruction as in SIMD prefix sum on Intel cpu or How to count character occurrences using SIMD. Or even use SIMD for short non-looping problems like Most insanely fastest way to convert 9 char digits into an int or unsigned int or How to convert a binary integer number to a hex string?

The term "vectorization" is also used to describe a higher level software transformation where you might just abstract away the loop altogether and just describe operating on arrays instead of the elements that comprise them. e.g. writing C = A + B in some language that allows that when those are arrays or matrices, unlike C or C++. In lower-level languages like that, you could describe calling BLAS or Eigen library functions instead of manually writing loops as a vectorized programming style. Some other answers on this question focus on that meaning of vectorization, and higher-level languages.

盛夏尉蓝 2024-08-11 21:28:25

矢量化是将标量程序转换为矢量程序的术语。矢量化程序可以从一条指令运行多个操作,而标量只能一次对操作数对进行操作。

来自 wikipedia

标量方法:

for (i = 0; i < 1024; i++)
{
   C[i] = A[i]*B[i];
}

矢量化方法:

for (i = 0; i < 1024; i+=4)
{
   C[i:i+3] = A[i:i+3]*B[i:i+3];
}

Vectorization is the term for converting a scalar program to a vector program. Vectorized programs can run multiple operations from a single instruction, whereas scalar can only operate on pairs of operands at once.

From wikipedia:

Scalar approach:

for (i = 0; i < 1024; i++)
{
   C[i] = A[i]*B[i];
}

Vectorized approach:

for (i = 0; i < 1024; i+=4)
{
   C[i:i+3] = A[i:i+3]*B[i:i+3];
}
坏尐絯℡ 2024-08-11 21:28:25

矢量化广泛应用于需要高效处理大量数据的科学计算中。

在实际的编程应用程序中,我知道它在 NUMPY 中使用(不确定其他)。

Numpy(Python 中的科学计算包)使用向量化来快速操作 n 维数组,如果使用内置的 Python 选项来处理数组,通常会更慢。

尽管有大量的解释,但这里的向量化NUMPY DOCUMENTATION PAGE中定义的内容

向量化描述了代码中没有任何显式循环、索引等 -当然,这些事情只是在优化的预编译 C 代码的“幕后”发生。矢量化代码有很多优点,其中包括:

  1. 矢量化代码更简洁,更易于阅读

  2. 更少的代码行通常意味着更少的错误

  3. 代码更接近标准的数学符号
    (通常可以更容易地正确编码数学
    构造)

  4. 向量化会产生更多“Pythonic”代码。没有
    向量化,我们的代码将充满低效和
    for 循环难以阅读。

Vectorization is used greatly in scientific computing where huge chunks of data needs to be processed efficiently.

In real programming application , i know it's used in NUMPY(not sure of other else).

Numpy (package for scientific computing in python) , uses vectorization for speedy manipulation of n-dimensional array ,which generally is slower if done with in-built python options for handling arrays.

although tons of explanation are out there , HERE'S WHAT VECTORIZATION IS DEFINED AS IN NUMPY DOCUMENTATION PAGE

Vectorization describes the absence of any explicit looping, indexing, etc., in the code - these things are taking place, of course, just “behind the scenes” in optimized, pre-compiled C code. Vectorized code has many advantages, among which are:

  1. vectorized code is more concise and easier to read

  2. fewer lines of code generally means fewer bugs

  3. the code more closely resembles standard mathematical notation
    (making it easier, typically, to correctly code mathematical
    constructs)

  4. vectorization results in more “Pythonic” code. Without
    vectorization, our code would be littered with inefficient and
    difficult to read for loops.

染年凉城似染瑾 2024-08-11 21:28:25

它指的是在一个步骤中对数字列表(或“向量”)进行单个数学运算的能力。你经常在 Fortran 中看到它,因为它与科学计算相关,而科学计算与超级计算相关,矢量化算术首次出现在超级计算中。如今,几乎所有桌面 CPU 都通过英特尔 SSE 等技术提供某种形式的矢量化算术。 GPU 还提供一种矢量化算术形式。

It refers to a the ability to do single mathematical operation on a list -- or "vector" -- of numbers in a single step. You see it often with Fortran because that's associated with scientific computing, which is associated with supercomputing, where vectorized arithmetic first appeared. Nowadays almost all desktop CPUs offer some form of vectorized arithmetic, through technologies like Intel's SSE. GPUs also offer a form of vectorized arithmetic.

眼眸里的快感 2024-08-11 21:28:25

简单来说,矢量化意味着优化算法,使其能够利用处理器中的 SIMD 指令。

AVX、AVX2 和 AVX512 是在一条指令中对多个数据执行相同操作的指令集(intel)。例如。 AVX512 意味着您可以一次操作 16 个整数值(4 个字节)。这意味着,如果您有 16 个整数的向量,并且您希望将每个整数中的值加倍,然后加上 10。您可以将值加载到通用寄存器 [a,b,c] 16 次并执行相同的操作,也可以通过将所有 16 个值加载到 SIMD 寄存器 [xmm,ymm] 并执行一次操作来执行相同的操作。这可以加快矢量数据的计算速度。

在矢量化中,我们通过重构数据来利用这一点,以便我们可以对其执行 SIMD 操作并加快程序速度。

矢量化的唯一问题是处理条件。因为条件会分支执行流程。这可以通过屏蔽来处理。通过将条件建模为算术运算。例如。如果我们想在 value 大于 100 的情况下加 10。我们都可以。

if(x[i] > 100) x[i] += 10; // this will branch execution flow.

或者我们可以将条件建模为算术运算,创建条件向量 c,

c[i] = x[i] > 100; // storing the condition on masking vector
x[i] = x[i] + (c[i] & 10) // using mask

虽然这是一个非常简单的例子...因此,c 是我们的掩码向量,我们用它来根据其值执行二元运算。这避免了执行流的分支并实现矢量化。

矢量化与并行化同样重要。因此,我们应该尽可能地利用它。所有现代处理器都具有用于繁重计算工作负载的 SIMD 指令。我们可以通过矢量化来优化我们的代码以使用这些 SIMD 指令,这类似于并行化我们的代码以在现代处理器上可用的多个内核上运行。

最后我想提一下 OpenMP,它可以让您使用编译指示对代码进行矢量化。我认为这是一个很好的起点。 OpenACC 也是如此。

Vectorization, in simple words, means optimizing the algorithm so that it can utilize SIMD instructions in the processors.

AVX, AVX2 and AVX512 are the instruction sets (intel) that perform same operation on multiple data in one instruction. for eg. AVX512 means you can operate on 16 integer values(4 bytes) at a time. What that means is that if you have vector of 16 integers and you want to double that value in each integers and then add 10 to it. You can either load values on to general register [a,b,c] 16 times and perform same operation or you can perform same operation by loading all 16 values on to SIMD registers [xmm,ymm] and perform the operation once. This lets speed up the computation of vector data.

In vectorization we use this to our advantage, by remodelling our data so that we can perform SIMD operations on it and speed up the program.

Only problem with vectorization is handling conditions. Because conditions branch the flow of execution. This can be handled by masking. By modelling the condition into an arithmetic operation. eg. if we want to add 10 to value if it is greater then 100. we can either.

if(x[i] > 100) x[i] += 10; // this will branch execution flow.

or we can model the condition into arithmetic operation creating a condition vector c,

c[i] = x[i] > 100; // storing the condition on masking vector
x[i] = x[i] + (c[i] & 10) // using mask

this is very trivial example though... thus, c is our masking vector which we use to perform binary operation based on its value. This avoid branching of execution flow and enables vectorization.

Vectorization is as important as Parallelization. Thus, we should make use of it as much possible. All modern days processors have SIMD instructions for heavy compute workloads. We can optimize our code to use these SIMD instructions using vectorization, this is similar to parrallelizing our code to run on multiple cores available on modern processors.

I would like to leave with the mention of OpenMP, which lets yo vectorize the code using pragmas. I consider it as a good starting point. Same can be said for OpenACC.

初懵 2024-08-11 21:28:25

我认为英特尔人很容易掌握。

矢量化是将算法从操作转换为
一次对一个值进行操作 一次对一组值进行操作
时间
。现代 CPU 直接支持矢量运算,其中
单指令应用于多数据(SIMD)。

例如,具有 512 位寄存器的 CPU 可以容纳 16 个 32 位
单精度双精度并进行一次计算。

比一次执行一条指令快 16 倍。结合
对于线程和多核 CPU,这会导致数量级的提高
性能提升。

链接 https ://software.intel.com/en-us/articles/vectorization-a-key-tool-to-improve-performance-on-modern-cpus

在 Java 中,可以选择将其包含在 JDK 中2020 年 15 月 15 日或 2021 年 JDK 16 后期。请参阅此官方问题

By Intel people I think is easy to grasp.

Vectorization is the process of converting an algorithm from operating
on a single value at a time to operating on a set of values at one
time
. Modern CPUs provide direct support for vector operations where a
single instruction is applied to multiple data (SIMD).

For example, a CPU with a 512 bit register could hold 16 32- bit
single precision doubles and do a single calculation.

16 times faster than executing a single instruction at a time. Combine
this with threading and multi-core CPUs leads to orders of magnitude
performance gains.

Link https://software.intel.com/en-us/articles/vectorization-a-key-tool-to-improve-performance-on-modern-cpus

In Java there is a option to this be included in JDK 15 of 2020 or late at JDK 16 at 2021. See this official issue.

感悟人生的甜 2024-08-11 21:28:25

希望你一切都好!

矢量化是指将缩放器实现(其中单个操作一次处理单个实体)转换为矢量实现(其中单个操作同时处理多个实体)的所有技术。

矢量化是一种技术,借助它我们可以优化代码以有效地处理大量数据。矢量化在 NumPy、pandas 等科学应用中的应用,您也可以在使用 Matlab、图像处理、NLP 等时使用此技术。总的来说,它优化了程序的运行时间和内存分配。

希望您能得到答案!

谢谢。

hope you are well!

vectorization refers to all the techniques that convert scaler implementation, in which a single operation processes a single entity at a time to vector implementation in which a single operation processes multiple entities at the same time.

Vectorization refers to a technique with the help of which we optimize the code to work with huge chunks of data efficiently. application of vectorization seen in scientific applications like NumPy, pandas also you can use this technique while working with Matlab, image processing, NLP, and much more. Overall it optimizes the runtime and memory allocation of the program.

Hope you may get your answer!

Thank you. ????

聚集的泪 2024-08-11 21:28:25

我将定义向量化给定语言的一个功能,其中如何迭代某个集合的元素的责任可以从程序员(例如元素的显式循环)委托给由语言(例如隐式循环)。

现在,我们为什么要这么做?

  1. 代码可读性。对于某些(但不是全部!)情况,一次对整个集合进行操作而不是对其元素进行操作更容易阅读并且更快地编写代码;
  2. 一些解释性语言(R、Python、Matlab...但不是 Julia)在处理显式循环方面确实很慢。在这些情况下,向量化在底层使用编译指令来进行这些“元素顺序处理”,并且可以比处理每个程序员指定的循环操作快几个数量级;
  3. 大多数现代 CPU(以及现在的 GPU)都具有内置并行化功能,当我们使用语言提供的矢量化方法而不是我们自己实现的元素操作顺序时,可以利用这种并行化功能;
  4. 以类似的方式,我们选择的编程语言可能会用于一些矢量化操作(例如矩阵操作)软件库(例如BLAS/LAPACK),这些软件库利用CPU的多线程功能,这是并行计算的另一种形式。

请注意,对于第 3 点和第 4 点,某些语言(尤其是 Julia)允许使用程序员定义的顺序处理(例如 for 循环)来利用这些硬件并行化,但是当使用语言提供的矢量化方法。

现在,虽然矢量化有很多优点,但有时使用显式循环比矢量化更直观地表达算法(也许我们需要诉诸复杂的线性代数运算、恒等和对角矩阵......所有这些都是为了保留我们的“矢量化”方法) ,如果使用显式排序形式没有计算上的缺点,则应首选此形式。

I would define vectorisation a feature of a given language where the responsibility on how to iterate over the elements of a certain collection can be delegated from the programmer (e.g. explicit loop of the elements) to some method provided by the language (e.g. implicit loop).

Now, why do we ever want to do that ?

  1. Code readeability. For some (but not all!) cases operating over the entire collection at once rather than to its elements is easier to read and quicker to code;
  2. Some interpreted languages (R, Python, Matlab.. but not Julia for example) are really slow in processing explicit loops. In these cases vectorisation uses under the hood compiled instructions for these "element order processing" and can be several orders of magnitude faster than processing each programmer-specified loop operation;
  3. Most modern CPUs (and, nowadays, GPUs) have build-in parallelization that is exploitable when we use the vectorisation method provided by the language rather than our self-implemented order of operations of the elements;
  4. In a similar way our programming language of choice will likely use for some vectorisation operations (e.g. matrix operations) software libraries (e.g. BLAS/LAPACK) that exploit multi-threading capabilities of the CPU, another form of parallel computation.

Note that for points 3 and 4 some languages (Julia notably) allow these hardware parallelizations to be exploited also using programmer-defined order processing (e.g. for loops), but this happens automatically and under the hood when using the vectorisation method provided by the language.

Now, while vectorisation has many advantages, sometimes an algorithm is more intuitively expressed using an explicit loop than vectorisation (where perhaps we need to resort to complex linear algebra operations, identity and diagonal matrices... all to retain our "vectorised" approach), and if using an explicit ordering form has no computational disadvantages, this one should be preferred.

命比纸薄 2024-08-11 21:28:25

参见上面两个答案。我只是想补充一点,想要进行矢量化的原因是这些操作可以很容易地由超级计算机和多处理器并行执行,从而产生巨大的性能增益。在单处理器计算机上,不会有任何性能提升。

See the two answers above. I just wanted to add that the reason for wanting to do vectorization is that these operations can easily be performed in paraell by supercomputers and multi-processors, yielding a big performance gain. On single processor computers there will be no performance gain.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文