Cub设备扫描使用自定义扫描OP失败
我正在使用
template <typename T>
struct MultAddFunctor
{
const T factor;
MultAddFunctor(T factor) : factor(factor) {}
__device__ __forceinline__
T operator()(const T &a, const T &b) const {
return factor*a + b;
}
};
否则时,我的代码几乎与文档中的示例代码相同(除了我释放了分配的内存并添加了其他同步以排除该问题为问题)。当factor
是1.0
时,这会产生正确的结果(这只是前缀sum)。当factor
是其他东西(例如0.8
)时,结果对于前12个值是正确的,但之后大大差异。例如,如果被扫描的数组只是一堆1.0
s,我会得到以下结果:
CUDA Serial
0 1.000 1.000 ✅
1 1.800 1.800 ✅
2 2.440 2.440 ✅
3 2.952 2.952 ✅
4 3.362 3.362 ✅
5 3.689 3.689 ✅
6 3.951 3.951 ✅
7 4.161 4.161 ✅
8 4.329 4.329 ✅
9 4.463 4.463 ✅
10 4.571 4.571 ✅
11 4.656 4.656 ✅
12 6.313 4.725 ❌
13 6.050 4.780 ❌
14 5.840 4.824 ❌
15 5.672 4.859 ❌
...
在元素12中,值突然跳跃,然后减少,即使这应该只是只有继续以一致的速度越来越大。
起初,我认为这是由于操作的非交换性,但文档明确指出这很好。我还认为factor
字段本身可能无法正确访问设备,但是即使我在等式中硬编码a 0.8
,它仍然不正确(尽管,尽管因子
可能总是在全局内存中,因此将来移动factor
陷入共享/本地会更好)。
扫描还可以计算错误结果的其他原因是什么?
I am using CUB::InclusiveScan which takes a custom binary, non-commutative, operator. When defining my
template <typename T>
struct MultAddFunctor
{
const T factor;
MultAddFunctor(T factor) : factor(factor) {}
__device__ __forceinline__
T operator()(const T &a, const T &b) const {
return factor*a + b;
}
};
Otherwise, my code is nearly identical to the example code in the documentation (except I have freed allocated memory and added additional syncs to rule that out as a problem). When factor
is 1.0
this produces the correct results (which is just a prefix-sum). When factor
is something else (such as 0.8
) the results are correct for the first 12 values but diverge considerably after that. For example, if the array being scanned is just a bunch of 1.0
s, I get the following results:
CUDA Serial
0 1.000 1.000 ✅
1 1.800 1.800 ✅
2 2.440 2.440 ✅
3 2.952 2.952 ✅
4 3.362 3.362 ✅
5 3.689 3.689 ✅
6 3.951 3.951 ✅
7 4.161 4.161 ✅
8 4.329 4.329 ✅
9 4.463 4.463 ✅
10 4.571 4.571 ✅
11 4.656 4.656 ✅
12 6.313 4.725 ❌
13 6.050 4.780 ❌
14 5.840 4.824 ❌
15 5.672 4.859 ❌
...
At element 12, there is a sudden jump in the values and then a decrease even though this should just keep getting larger at a consistent rate.
At first, I thought it was due to the non-commutativity of the operation, but the docs explicitly state that that is fine. I also thought that the factor
field itself may not be getting to the device correctly, but even if I hard-code a 0.8
in the equation it is still incorrect (although, factor
is probably always in global memory so in the future moving factor
into shared/local would be better).
What other reason could there be that the scan is computing the incorrect results?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
发生故障的原因是Cub并行扫描如其他并行扫描实现我是我意识到,需要一个关联的二进制扫描操作员。
这直接源于关联的定义,以添加为例。附加的关联属性说,
当此处应用时,这意味着在并行设置中,我可以将二进制OP应用于
(a+b)
首先或(b+c)
首先,然后将二进制OP应用于其余步骤,我应该得到相同的结果。这是关联性,而不是交流性。这里的操作不是关联的。为了证明这一点,我们实际上执行了操作员在每种情况下指示的数学(想象/假装factor
为1000),并且我们得到:平等不保持。
The reason for the failure here is that cub parallel scan like other parallel scan implementations I am aware of, require a binary scan operator that is associative.
This follows directly from the definition of associativity, using addition as an example. The associative property of addition says that
When applied here, it means that in a parallel setting, I can apply the binary op to either
(a+b)
first, or(b+c)
first, and then apply the binary op to the remaining step, and I should get the same result. This is associativity, not commutativity. The op here is not associative. To demonstrate this, we actually perform the math indicated by the operator in each case (imagining/pretending thatfactor
is 1000) and we get:and the equality does not hold.