推力：sort_by_key 和 zip_iterator 性能

发布于 2024-11-02 17:02:58 字数 1643 浏览 6 评论 0原文

问题

我正在使用 sort_by_key 并使用 zip_iterator 传递值。这个 sort_by_key 被调用了很多次，并且在一定的迭代之后，它变得慢十倍！造成性能下降的原因是什么？

症状

我正在使用 sort_by_key 对 3 个向量进行排序，其中一个充当关键向量：

struct Segment
{
  int v[2];
};

thrust::device_vector<int> keyVec;
thrust::device_vector<int> valVec;
thrust::device_vector<Segment> segVec;

// ... code which fills these vectors ...

thrust::sort_by_key( keyVec.begin(), keyVec.end(),
                     make_zip_iterator( make_tuple( valVec.begin(), segVec.begin() ) ) );

向量的大小通常约为 400 万。在最初的 2 次调用中，sort_by_key 需要 0.04 秒，在循环 3 中需要 0.1 秒，然后在其余循环中进一步降低到 0.3 秒。因此，我们看到性能下降了 10 倍。

额外信息

为了确保唯一的降级因素是sort_by_key，我将上面的替换为使用手写内核的手动排序：

thrust::device_vector<int> indexVec( keyVec.size() );
thrust::sequence( indexVec.begin(), indexVec.end() );

// Sort the keys and indexes
thrust::sort_by_key( keyVec.begin(), keyVec.end(), indexVec.begin() );

thrust::device_vector<int> valVec2( keyVec.size() );
thrust::device_vector<Segment> segVec2( keyVec.size() );

// Use index array and move vectors to destination
moveKernel<<< x, y >>>(
  toRawPtr( indexVec ),
  indexVec.size(),
  toRawPtr( valVec ),
  toRawPtr( segVec ),
  toRawPtr( valVec2 ),
  toRawPtr( segVec2 ) );

// Swap back into original vectors
valVec.swap( valVec2 );
segVec.swap( segVec2 );

这种手写排序需要0.03s并且这个性能是一致的在所有迭代中，与 sort_by_key 和 zip_iterator 看到的性能下降不同。

原文

Problem

I am using sort_by_key with the values being passed using a zip_iterator. This sort_by_key is called many times, and after a certain iteration it becomes 10x slower! What is the cause of this drop in performance?

Symptom

I am sorting 3 vectors using sort_by_key, one of them acts as the key vector:

struct Segment
{
  int v[2];
};

thrust::device_vector<int> keyVec;
thrust::device_vector<int> valVec;
thrust::device_vector<Segment> segVec;

// ... code which fills these vectors ...

thrust::sort_by_key( keyVec.begin(), keyVec.end(),
                     make_zip_iterator( make_tuple( valVec.begin(), segVec.begin() ) ) );

The size of the vector is usually about 4 million. In the initial 2 times it is called, the sort_by_key takes 0.04s, in loop 3 it takes 0.1s and then degrades further to 0.3s for the rest of the loops. Thus, we see a 10x degradation in performance.

Extra Information

To ensure that the only factor of degradation was sort_by_key, I replaced the above with manual sorting using a handwritten kernel:

thrust::device_vector<int> indexVec( keyVec.size() );
thrust::sequence( indexVec.begin(), indexVec.end() );

// Sort the keys and indexes
thrust::sort_by_key( keyVec.begin(), keyVec.end(), indexVec.begin() );

thrust::device_vector<int> valVec2( keyVec.size() );
thrust::device_vector<Segment> segVec2( keyVec.size() );

// Use index array and move vectors to destination
moveKernel<<< x, y >>>(
  toRawPtr( indexVec ),
  indexVec.size(),
  toRawPtr( valVec ),
  toRawPtr( segVec ),
  toRawPtr( valVec2 ),
  toRawPtr( segVec2 ) );

// Swap back into original vectors
valVec.swap( valVec2 );
segVec.swap( segVec2 );

This handwritten sort takes 0.03s and this performance is consistent across all iterations, unlike the performance drop seen with sort_by_key and zip_iterator.

分享到QQ

分享到微博