使用非零值初始化双精度数组 (BLAS)

发布于 2024-10-21 03:15:56 字数 113 浏览 4 评论 0原文

我分配了一个大的双向量,假设有 100000 个元素。在我的代码中的某个时刻,我想将所有元素设置为常量、非零值。如何在不对所有元素使用 for 循环的情况下做到这一点? 我也在使用 BLAS 包,如果有帮助的话。

I have allocated a big double vector, lets say with 100000 element. At some point in my code, I want to set all elements to a constant, nonzero value. How can I do this without using a for loop over all elements?
I am also using the BLAS package, if it helps.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

城歌 2024-10-28 03:15:56

您可以使用 std::fill (#include):

std::fill(v.begin(), v.end(), 1);

当然,这本质上也只是一个循环。

You could use std::fill (#include <algorithm>):

std::fill(v.begin(), v.end(), 1);

This is essentially also only a loop of course..

明明#如月 2024-10-28 03:15:56

你所说的“填充”是正确的。

请注意,也可以构造一个充满指定值的向量:

std::vector<double> vec(100000, 3.14);

因此,如果“在某个点”意味着“构造后立即”,请改为这样做。另外,这意味着您可以这样做:

std::vector<double>(100000, 3.14).swap(vec);

如果“在某个时刻”意味着“更改大小后立即”,并且您期望/希望重新分配向量(“期望”,如果您使其大于其先前的容量,如果您将其做得更小并希望对其进行修剪以节省内存,则“想要”)。

'fill' is right from what you've said.

Be aware that it's also possible to construct a vector full of a specified value:

std::vector<double> vec(100000, 3.14);

So if "at some point" means "immediately after construction", do this instead. Also, it means you can do this:

std::vector<double>(100000, 3.14).swap(vec);

which might be useful if "at some point" means "immediately after changing the size", and you expect/want the vector to be reallocated ("expect" if you're making it bigger than its prior capacity, "want" if you're making it much smaller and want it trimmed to save memory).

囚你心 2024-10-28 03:15:56

不幸的是,其他答案没有遵循提示,操作员想要将数组的元素设置为零。
使用 BLAS 而不是更惯用的函数(例如 memsetfill)可能有多种原因。
例如,BLAS 操作可以是线程化的。
此外,memset 和 fill 不提供开箱即用的跨步操作。

乍一看,BLAS 库似乎没有提供此类功能,但是,有两种选择:

  1. 可以利用 BLAS 函数 xSCALx 可以对于不同的数字类型,为 sdcz)。

SCAL 执行缩放操作V <- a*V
对于a = 0,它将元素设置为零(大部分)。

  1. 使用 xCOPY,并从堆栈内存中复制一个零。

完整代码如下。

两种方法都有自己的问题,第一种策略依赖于任何浮点数 x 的任何 x*0.0 == 0.0
从技术上讲,这对于 x == NAN 或 x == infinity 来说是不正确的(两种情况都有说明)。
也许 BLAS 可以以非 IEEE 兼容的方式编译,这实际上给出了这一点。
无论如何,如果由于某种原因您知道原始值是常规数字,那么您可以使用它。
另一个问题是,您可以将带符号的零(-0.0)作为元素,这是可以的,只是它们最终可能不都是相同的统一(例如正零,0.0)代码>)。

第二个更强大,但依赖于 BLAS 接受零步幅值。
(BLAS 是在 70 年代发明并用 Fortran 编写的,当时还没有发明整数零。)
我知道的大多数实现都允许零增量,至少对于 xCOPY 来说是这样。
它还需要从某个地方取出“第一个”零;在这个例子中刚刚在堆栈中创建。
(如果您要推广到 GPU BLAS (cuBLAS),那么您将需要在 GPU 中分配这个零。)

因此,换句话说,您必须了解您的平台和可用的 BLAS。

#include<cstdint>
#include<iostream>
#include<limits>

extern "C" {
void sscal_(int32_t const& n, float const& a, float* x, int32_t const& incx);
void scopy_(int32_t const& n, float const* x, int32_t const& incx, float* y, int32_t const& incy);
}

void set_zero_1(int32_t n, float* x, int incx) {
  sscal_(n, 0.0F, x, incx);
}

void fill_value(int32_t n, float* x, int incx, float const* value_ptr) {
  scopy_(n, value_ptr, 0, x, incx);
}

void set_zero_2(int32_t n, float* x, int incx) {
  float const value = 0.0F;  // can also be allocated or be a global if necessary
  fill_value(n, x, incx, &value);
}

int main() {
  float X[12] = {
    99.9, 0.0, 0.0, 
    std::numeric_limits<float>::quiet_NaN(), 0.0, 0.0, 
    std::numeric_limits<float>::infinity(), 0.0, 0.0, 
    99.9, 0.0, 0.0
  };
//set_zero_1(                 4,            &X[0],            3);  // this fails because NAN and INF
  set_zero_2(/*num elements*/ 4, /*origin*/ &X[0], /*stride*/ 3);
  for(int i = 0; i != 12; i += 3) std::cout << X[i] << std::endl;  // prints zeros
}

像这样使用。它将打印零。

$ c++ a.cpp -L/opt/intel/oneapi/mkl/2023.0.0/lib/intel64 -lmkl_rt 
$ export LD_LIBRARY_PATH=/opt/intel/oneapi/mkl/2023.0.0/lib/intel64
$ ./a.out
0
0
0
0

另外,还有这种方式,但它依赖于 BLAS 的顺序,并且它不能并行化或矢量化任何操作来使其工作,所以它可能是最糟糕的选择。

void set_zero_3(int32_t n, float* x, int incx) {
  *x = 0.0F;  // set one element to zero, somehow. (see note above about memory that is not accessible from the CPU)
  scopy_(n - 1, x, incx, x + incx, incx);
}

It is unfortunate the other answers didn't follow the prompt, the OP wanted to set the elements of an array to zero.
There could be several reason to use BLAS instead of more idiomatic functions (such as memset or fill).
For example BLAS operations could be threaded.
Also memset and fill do not offer out-of-the-box strided operations.

At first glance, it seems that the BLAS library doesn't offer such functionality, however, there are two options:

  1. One can exploit the BLAS function xSCAL for that (x can be s,d,c,z for different numeric types).

SCAL does the scale operation V <- a*V.
For a = 0 it sets the elements to zero (mostly).

  1. Use xCOPY, and copy a single zero from stack memory.

Full code below.

Both approaches have their own problems, the first strategy relies on any x*0.0 == 0.0 for any floating point number x.
Which is technically not true for x == NAN or x == infinity (both cases illustrated).
Maybe BLAS could be compiled in a non-IEEE compliant manner that in effect gives this.
In any case, if for some reason you know that the original values are regular numbers then you can use this.
Another issue is that you could get signed zeros (-0.0) as elements, which are ok, except that they may end up not all being the same uniform (e.g. positive zero, 0.0).

The second is more robust but relies on BLAS accepting zero-stride values.
(BLAS was invented and coded in Fortran in the 70's, integer zero was not invented at the time.)
Most implementations I know would allow zero increments, at least, for xCOPY.
It also needs to pull out the "first" zero from somewhere; in this example is just created in the stack.
(If you are generalizing to GPU BLAS (cuBLAS) then you will need to allocate this zero in the GPU.)

So, in other words, you have to know your platform and the BLAS you have available.

#include<cstdint>
#include<iostream>
#include<limits>

extern "C" {
void sscal_(int32_t const& n, float const& a, float* x, int32_t const& incx);
void scopy_(int32_t const& n, float const* x, int32_t const& incx, float* y, int32_t const& incy);
}

void set_zero_1(int32_t n, float* x, int incx) {
  sscal_(n, 0.0F, x, incx);
}

void fill_value(int32_t n, float* x, int incx, float const* value_ptr) {
  scopy_(n, value_ptr, 0, x, incx);
}

void set_zero_2(int32_t n, float* x, int incx) {
  float const value = 0.0F;  // can also be allocated or be a global if necessary
  fill_value(n, x, incx, &value);
}

int main() {
  float X[12] = {
    99.9, 0.0, 0.0, 
    std::numeric_limits<float>::quiet_NaN(), 0.0, 0.0, 
    std::numeric_limits<float>::infinity(), 0.0, 0.0, 
    99.9, 0.0, 0.0
  };
//set_zero_1(                 4,            &X[0],            3);  // this fails because NAN and INF
  set_zero_2(/*num elements*/ 4, /*origin*/ &X[0], /*stride*/ 3);
  for(int i = 0; i != 12; i += 3) std::cout << X[i] << std::endl;  // prints zeros
}

use like this. It will print zeros.

$ c++ a.cpp -L/opt/intel/oneapi/mkl/2023.0.0/lib/intel64 -lmkl_rt 
$ export LD_LIBRARY_PATH=/opt/intel/oneapi/mkl/2023.0.0/lib/intel64
$ ./a.out
0
0
0
0

Bonus, there is also this way, but it relies on BLAS being sequential, and it can't parallelize or vectorized any operation for this to work, so it is probably the worst option.

void set_zero_3(int32_t n, float* x, int incx) {
  *x = 0.0F;  // set one element to zero, somehow. (see note above about memory that is not accessible from the CPU)
  scopy_(n - 1, x, incx, x + incx, incx);
}
将军与妓 2024-10-28 03:15:56

如果您不想循环,则始终使用 memset()

也就是说,memset(myarr, 5, arrsize); 以便用全 5 填充它。请注意隐式转换为 unsigned char。

概要

 #include ;

 空白 *
 memset(void *b, int c, size_t len);

描述

 memset() 函数写入值 c 的 len 个字节(转换为
 unsigned char) 到字节字符串 b.

如果向量很大,并且您需要它快速运行并且您正在使用 gcc,那么:

块移动的代码生成(memcpy)
并重写了块集(memset)。
GCC 现在可以选择最佳算法
(循环、展开循环、指令
代表前缀或库调用)基于
正在复制的块的大小和
正在优化的CPU。

You always use memset() if you don't want to loop.

That is, memset(myarr, 5, arrsize); in order to fill it with all 5's. Beware of implicit conversion to unsigned char.

SYNOPSIS

 #include <string.h>

 void *
 memset(void *b, int c, size_t len);

DESCRIPTION

 The memset() function writes len bytes of value c (converted to an
 unsigned char) to the byte string b.

And if the vector is large, and you need it to go fast and you are using gcc, then :

Code generation of block move (memcpy)
and block set (memset) was rewritten.
GCC can now pick the best algorithm
(loop, unrolled loop, instruction with
rep prefix or a library call) based on
the size of the block being copied and
the CPU being optimized for.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文