Myers Diff 适合在 GPU 上运行吗?
我有兴趣通过在 GPU(即使用 OpenCL)上运行来实现更快的 Myers diff 实现。我对该算法有很好的了解,但对 GPU 编程还是新手。我的预感是 GPU 的性能不会很好,但我想听听大家的想法和想法。
下面是对 C 算法的一次迭代的描述。我们有两个字节常量缓冲区“左”和“右”(我们正在比较的数据),以及一个称为向量的共享 int32 可变数组。 'idx' 是迭代索引。那么算法本质上是这样的:
void myers_diff_iteration(const uint8 *left, const uint8 *right, int32 *vector, int32 idx) {
int32 x = MAX(vector[idx-1], vector[idx+1]);
int32 y = idx - x;
while (left[x] == right[y]) {
x++;
y++;
}
vector[x] = x;
}
我的猜测是 while 循环(其迭代次数非常不可预测,范围从零到一百万)可能对 GPU 来说非常糟糕,并且消除了任何性能增益。这是真的吗?有关如何改进它的任何提示吗?
此外,该向量在循环的所有迭代之间共享。每次迭代都会写入不同的位置,因此不需要同步(除了要求写入对齐的 4 字节字不会影响相邻字之外)。这样的共享向量会表现良好吗?
感谢您的帮助!
I'm interested in making a faster Myers diff implementation by running it on a GPU, i.e. with OpenCL. I have a good understanding of the algorithm but am new to GPU programming. My hunch is that the GPU will not perform well, but I'd like to hear thoughts and ideas.
Here's a description of one iteration of the algorithm in C. We have two constant buffers of bytes 'left' and 'right' (the data we're comparing), and a shared mutable array of int32 called vector. 'idx' is the iteration index. Then the algorithm is essentially this:
void myers_diff_iteration(const uint8 *left, const uint8 *right, int32 *vector, int32 idx) {
int32 x = MAX(vector[idx-1], vector[idx+1]);
int32 y = idx - x;
while (left[x] == right[y]) {
x++;
y++;
}
vector[x] = x;
}
My guess is that the while loop (which has a very unpredictable number of iterations, ranging from zero to a million) is likely to be very bad on the GPU, and eliminate any performance gain. Is that true? Any hints for how to improve it?
Also, the vector is shared between all iterations of the loop. Each iteration writes to a different location, so there's no synchronization needed (beyond requiring that writes to an aligned 4-byte word don't affect neighboring words). Is such a shared vector going to perform well?
Thanks for any help!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
你可以尝试一下。 GPU 的 while 循环会出现严重问题,但只要有足够的“迭代”(线程)运行,就不应该有任何速度损失。
您可以这样重写:
您只需要运行最大 while 循环线程数。但每次 OpenCL 运行它只会产生 1 个向量结果。
最好尝试一下,然后做一些变化。
You could try. The GPU will have serious problems with the while loop, but as long as there is enough "iterations" (threads) running, shouldn't be any speed lose.
You could rewrite it this way:
You only then need to run the max while loop numer of threads. But it will only produce 1 result of the vector each OpenCL run.
The best it to try, and then do some variations.