Cub设备扫描使用自定义扫描OP失败

发布于 2025-02-11 22:28:13 字数 1278 浏览 2 评论 0原文

我正在使用

template <typename T>
struct MultAddFunctor
{
    const T factor;
    MultAddFunctor(T factor) : factor(factor) {}
    
    __device__ __forceinline__
    T operator()(const T &a, const T &b) const {
        return factor*a + b;
    }
};

否则时，我的代码几乎与文档中的示例代码相同（除了我释放了分配的内存并添加了其他同步以排除该问题为问题）。当factor是1.0时，这会产生正确的结果（这只是前缀sum）。当factor是其他东西（例如0.8）时，结果对于前12个值是正确的，但之后大大差异。例如，如果被扫描的数组只是一堆1.0 s，我会得到以下结果：

        CUDA     Serial
   0     1.000     1.000 ✅
   1     1.800     1.800 ✅
   2     2.440     2.440 ✅
   3     2.952     2.952 ✅
   4     3.362     3.362 ✅
   5     3.689     3.689 ✅
   6     3.951     3.951 ✅
   7     4.161     4.161 ✅
   8     4.329     4.329 ✅
   9     4.463     4.463 ✅
  10     4.571     4.571 ✅
  11     4.656     4.656 ✅
  12     6.313     4.725 ❌
  13     6.050     4.780 ❌
  14     5.840     4.824 ❌
  15     5.672     4.859 ❌
...

在元素12中，值突然跳跃，然后减少，即使这应该只是只有继续以一致的速度越来越大。

起初，我认为这是由于操作的非交换性，但文档明确指出这很好。我还认为factor字段本身可能无法正确访问设备，但是即使我在等式中硬编码a 0.8，它仍然不正确（尽管，尽管因子可能总是在全局内存中，因此将来移动factor陷入共享/本地会更好）。

扫描还可以计算错误结果的其他原因是什么？

原文

I am using CUB::InclusiveScan which takes a custom binary, non-commutative, operator. When defining my

template <typename T>
struct MultAddFunctor
{
    const T factor;
    MultAddFunctor(T factor) : factor(factor) {}
    
    __device__ __forceinline__
    T operator()(const T &a, const T &b) const {
        return factor*a + b;
    }
};

Otherwise, my code is nearly identical to the example code in the documentation (except I have freed allocated memory and added additional syncs to rule that out as a problem). When factor is 1.0 this produces the correct results (which is just a prefix-sum). When factor is something else (such as 0.8) the results are correct for the first 12 values but diverge considerably after that. For example, if the array being scanned is just a bunch of 1.0s, I get the following results:

        CUDA     Serial
   0     1.000     1.000 ✅
   1     1.800     1.800 ✅
   2     2.440     2.440 ✅
   3     2.952     2.952 ✅
   4     3.362     3.362 ✅
   5     3.689     3.689 ✅
   6     3.951     3.951 ✅
   7     4.161     4.161 ✅
   8     4.329     4.329 ✅
   9     4.463     4.463 ✅
  10     4.571     4.571 ✅
  11     4.656     4.656 ✅
  12     6.313     4.725 ❌
  13     6.050     4.780 ❌
  14     5.840     4.824 ❌
  15     5.672     4.859 ❌
...

At element 12, there is a sudden jump in the values and then a decrease even though this should just keep getting larger at a consistent rate.

At first, I thought it was due to the non-commutativity of the operation, but the docs explicitly state that that is fine. I also thought that the factor field itself may not be getting to the device correctly, but even if I hard-code a 0.8 in the equation it is still incorrect (although, factor is probably always in global memory so in the future moving factor into shared/local would be better).

What other reason could there be that the scan is computing the incorrect results?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

魄砕の薆 2025-02-18 22:28:13

发生故障的原因是Cub并行扫描如其他并行扫描实现我是我意识到，需要一个关联的二进制扫描操作员。

这直接源于关联的定义，以添加为例。附加的关联属性说，

(a+b)+c = a+(b+c)

当此处应用时，这意味着在并行设置中，我可以将二进制OP应用于（a+b）首先或（b+c）首先，然后将二进制OP应用于其余步骤，我应该得到相同的结果。这是关联性，而不是交流性。这里的操作不是关联的。为了证明这一点，我们实际上执行了操作员在每种情况下指示的数学（想象/假装factor为1000），并且我们得到：

step 1: (1000*a+b)+c = a+(1000*b+c)
step 2: 1000(1000*a+b)+c = 1000*a+(1000*b+c)

平等不保持。

The reason for the failure here is that cub parallel scan like other parallel scan implementations I am aware of, require a binary scan operator that is associative.

This follows directly from the definition of associativity, using addition as an example. The associative property of addition says that

(a+b)+c = a+(b+c)

When applied here, it means that in a parallel setting, I can apply the binary op to either (a+b) first, or (b+c) first, and then apply the binary op to the remaining step, and I should get the same result. This is associativity, not commutativity. The op here is not associative. To demonstrate this, we actually perform the math indicated by the operator in each case (imagining/pretending that factor is 1000) and we get:

step 1: (1000*a+b)+c = a+(1000*b+c)
step 2: 1000(1000*a+b)+c = 1000*a+(1000*b+c)

and the equality does not hold.

回复收藏 0 原文

~没有更多了~

关于作者

不交电费瞎发啥光

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

Cub设备扫描使用自定义扫描OP失败

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

李珊平

Quxin

范无咎

github_ZOJ2N8YxBm

若言

南…巷孤猫

友情链接

Cub设备扫描使用自定义扫描OP失败

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

李珊平

Quxin

范无咎

github_ZOJ2N8YxBm

若言

南…巷孤猫

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。