Myers Diff 适合在 GPU 上运行吗？

发布于 2024-11-14 05:12:40 字数 721 浏览 4 评论 0原文

我有兴趣通过在 GPU（即使用 OpenCL）上运行来实现更快的 Myers diff 实现。我对该算法有很好的了解，但对 GPU 编程还是新手。我的预感是 GPU 的性能不会很好，但我想听听大家的想法和想法。

下面是对 C 算法的一次迭代的描述。我们有两个字节常量缓冲区“左”和“右”（我们正在比较的数据），以及一个称为向量的共享 int32 可变数组。 'idx' 是迭代索引。那么算法本质上是这样的：

   void myers_diff_iteration(const uint8 *left, const uint8 *right, int32 *vector, int32 idx) {
       int32 x = MAX(vector[idx-1], vector[idx+1]);
       int32 y = idx - x;
       while (left[x] == right[y]) {
           x++;
           y++;
       }
       vector[x] = x;
   }

我的猜测是 while 循环（其迭代次数非常不可预测，范围从零到一百万）可能对 GPU 来说非常糟糕，并且消除了任何性能增益。这是真的吗？有关如何改进它的任何提示吗？

此外，该向量在循环的所有迭代之间共享。每次迭代都会写入不同的位置，因此不需要同步（除了要求写入对齐的 4 字节字不会影响相邻字之外）。这样的共享向量会表现良好吗？

感谢您的帮助！

原文

I'm interested in making a faster Myers diff implementation by running it on a GPU, i.e. with OpenCL. I have a good understanding of the algorithm but am new to GPU programming. My hunch is that the GPU will not perform well, but I'd like to hear thoughts and ideas.

Here's a description of one iteration of the algorithm in C. We have two constant buffers of bytes 'left' and 'right' (the data we're comparing), and a shared mutable array of int32 called vector. 'idx' is the iteration index. Then the algorithm is essentially this:

   void myers_diff_iteration(const uint8 *left, const uint8 *right, int32 *vector, int32 idx) {
       int32 x = MAX(vector[idx-1], vector[idx+1]);
       int32 y = idx - x;
       while (left[x] == right[y]) {
           x++;
           y++;
       }
       vector[x] = x;
   }

My guess is that the while loop (which has a very unpredictable number of iterations, ranging from zero to a million) is likely to be very bad on the GPU, and eliminate any performance gain. Is that true? Any hints for how to improve it?

Also, the vector is shared between all iterations of the loop. Each iteration writes to a different location, so there's no synchronization needed (beyond requiring that writes to an aligned 4-byte word don't affect neighboring words). Is such a shared vector going to perform well?

Thanks for any help!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦年海沫深 2024-11-21 05:12:40

你可以尝试一下。 GPU 的 while 循环会出现严重问题，但只要有足够的“迭代”（线程）运行，就不应该有任何速度损失。

您可以这样重写：

   void myers_diff_iteration(const uint8 *left, const uint8 *right, int32 *vector, int32 idx) {
       int id = get_global_id(0);
       int32 x = MAX(vector[idx-1], vector[idx+1]) + id;
       int32 y = idx - x + id;
       if (left[x] != right[y]) {
           vector[x] = x;
       }
   }

您只需要运行最大 while 循环线程数。但每次 OpenCL 运行它只会产生 1 个向量结果。

最好尝试一下，然后做一些变化。

You could try. The GPU will have serious problems with the while loop, but as long as there is enough "iterations" (threads) running, shouldn't be any speed lose.

You could rewrite it this way:

   void myers_diff_iteration(const uint8 *left, const uint8 *right, int32 *vector, int32 idx) {
       int id = get_global_id(0);
       int32 x = MAX(vector[idx-1], vector[idx+1]) + id;
       int32 y = idx - x + id;
       if (left[x] != right[y]) {
           vector[x] = x;
       }
   }

You only then need to run the max while loop numer of threads. But it will only produce 1 result of the vector each OpenCL run.

The best it to try, and then do some variations.

回复收藏 0 原文

~没有更多了~