分配共享内存

发布于 2024-10-28 21:45:23 字数 259 浏览 2 评论 0原文

我试图通过使用常量参数来分配共享内存,但出现错误。我的内核看起来像这样:

__global__ void Kernel(const int count)
{
    __shared__ int a[count];
}

并且我收到一条错误消息

错误:表达式必须具有常量值

count is const!为什么我会收到此错误?我该如何解决这个问题?

i am trying to allocate shared memory by using a constant parameter but getting an error. my kernel looks like this:

__global__ void Kernel(const int count)
{
    __shared__ int a[count];
}

and i am getting an error saying

error: expression must have a constant value

count is const! Why am I getting this error? And how can I get around this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

倚栏听风 2024-11-04 21:45:23

CUDA 支持动态共享内存分配。如果您像这样定义内核:

__global__ void Kernel(const int count)
{
    extern __shared__ int a[];
}

然后将所需的字节数作为内核启动的第三个参数传递

Kernel<<< gridDim, blockDim, a_size >>>(count)

,那么可以在运行时调整它的大小。请注意,运行时仅支持每个块动态声明的分配。如果您需要更多,则需要使用指向该单个分配内的偏移量的指针。使用指针时还要注意,共享内存使用 32 位字,并且所有分配都必须是 32 位字对齐,无论共享内存分配的类型如何。

CUDA supports dynamic shared memory allocation. If you define the kernel like this:

__global__ void Kernel(const int count)
{
    extern __shared__ int a[];
}

and then pass the number of bytes required as the the third argument of the kernel launch

Kernel<<< gridDim, blockDim, a_size >>>(count)

then it can be sized at run time. Be aware that the runtime only supports a single dynamically declared allocation per block. If you need more, you will need to use pointers to offsets within that single allocation. Also be aware when using pointers that shared memory uses 32 bit words, and all allocations must be 32 bit word aligned, irrespective of the type of the shared memory allocation.

累赘 2024-11-04 21:45:23

const 并不意味着“常量”,它意味着“只读”。

常量表达式是编译器在编译时已知其值的东西。

const doesn't mean "constant", it means "read-only".

A constant expression is something whose value is known to the compiler at compile-time.

泛滥成性 2024-11-04 21:45:23

选项一:使用常量值声明共享内存(与 const 不同)

__global__ void Kernel(int count_a, int count_b)
{
    __shared__ int a[100];
    __shared__ int b[4];
}

选项二:在内核启动配置中动态声明共享内存:

__global__ void Kernel(int count_a, int count_b)
{
    extern __shared__ int *shared;
    int *a = &shared[0]; //a is manually set at the beginning of shared
    int *b = &shared[count_a]; //b is manually set at the end of a
}

sharedMemory = count_a*size(int) + size_b*size(int);
Kernel <<<numBlocks, threadsPerBlock, sharedMemory>>> (count_a, count_b);

注意:指向动态共享内存的指针是 所有都具有相同的地址。我用两个共享内存数组来说明如何在共享内存中手动设置两个数组。

option one: declare shared memory with constant value (not the same as const)

__global__ void Kernel(int count_a, int count_b)
{
    __shared__ int a[100];
    __shared__ int b[4];
}

option two: declare shared memory dynamically in the kernel launch configuration:

__global__ void Kernel(int count_a, int count_b)
{
    extern __shared__ int *shared;
    int *a = &shared[0]; //a is manually set at the beginning of shared
    int *b = &shared[count_a]; //b is manually set at the end of a
}

sharedMemory = count_a*size(int) + size_b*size(int);
Kernel <<<numBlocks, threadsPerBlock, sharedMemory>>> (count_a, count_b);

note: Pointers to dynamically shared memory are all given the same address. I use two shared memory arrays to illustrate how to manually set up two arrays in shared memory.

热情消退 2024-11-04 21:45:23

来自《CUDA C 编程指南》:

执行配置是通过插入以下形式的表达式来指定的:

<<<Dg, Db, Ns, S>>>

其中:

  • Dg 的类型为 dim3,并指定网格 ...
  • Db 的类型为 dim3,并指定每个块的维度和大小 ...
  • Ns 的类型为 size_t 并指定除了静态分配的内存之外,为此调用每块动态分配的共享内存中的字节数。此动态分配的内存可供任何声明为外部数组的变量使用,如 __shared__ 中所述; Ns 是可选参数,默认为 0;
  • S 是 cudaStream_t 类型并指定关联的流...

因此,通过使用动态参数 Ns,用户可以指定一个内核函数可以使用的共享内存的总大小,无论有多少个共享变量这个内核里有。

From the "CUDA C Programming Guide":

The execution configuration is specified by inserting an expression of the form:

<<<Dg, Db, Ns, S>>>

where:

  • Dg is of type dim3 and specifies the dimensioin and size of the grid ...
  • Db is of type dim3 and specifies the dimension and size of each block ...
  • Ns is of type size_t and specifies the number of bytes in shared memory that is dynamically allocated per block for this call in addition to the statically allocated memory. This dynamically allocated memory is used by any of the variables declared as an external array as mentioned in __shared__; Ns is optional argument which defaults to 0;
  • S is of type cudaStream_t and specifies the associated stream ...

So by using the dynamical parameter Ns, the user can specify the total size of shared memory one kernel function can use, no matter how many shared variables there are in this kernel.

泪之魂 2024-11-04 21:45:23

你不能像这样声明共享变量..

__shared__ int a[count];

虽然如果你对数组 a 的最大大小足够确定那么你可以直接声明,

__shared__ int a[100];

但在这种情况下你应该担心你的程序中有多少块,因为修复了共享内存到一个块(并且没有得到充分利用),将导致您使用全局内存进行上下文切换(高延迟),从而导致性能不佳...

这个问题有一个很好的解决方案,可以

extern __shared__ int a[];

在从内存调用内核时声明和分配内存类似

Kernel<<< gridDim, blockDim, a_size >>>(count)

,但您也应该在这里感到困扰,因为如果您在块中使用的内存多于在内核中分配的内存,您将得到意想不到的结果。

You cannot declare shared variable like this..

__shared__ int a[count];

although if you are sure enough about the max size of array a then you can directly declare like

__shared__ int a[100];

but in this case you should be worried about how many blocks are there in your program , since fixing shared memory to a block ( and not getting utilized fully), will lead you to context switching with global memory( high latency) , thus poor performance...

There is a nice solution to this problem to declare

extern __shared__ int a[];

and allocating the memory while calling kernel from memory like

Kernel<<< gridDim, blockDim, a_size >>>(count)

but you should also be bothered here because if you are using more memory in blocks than you are assigning in kernel , you are going to getting unexpected results.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文