推力:sort_by_key 和 zip_iterator 性能
问题
我正在使用 sort_by_key
并使用 zip_iterator
传递值。这个 sort_by_key
被调用了很多次,并且在一定的迭代之后,它变得慢十倍!造成性能下降的原因是什么?
症状
我正在使用 sort_by_key
对 3 个向量进行排序,其中一个充当关键向量:
struct Segment
{
int v[2];
};
thrust::device_vector<int> keyVec;
thrust::device_vector<int> valVec;
thrust::device_vector<Segment> segVec;
// ... code which fills these vectors ...
thrust::sort_by_key( keyVec.begin(), keyVec.end(),
make_zip_iterator( make_tuple( valVec.begin(), segVec.begin() ) ) );
向量的大小通常约为 400 万。在最初的 2 次调用中,sort_by_key
需要 0.04 秒,在循环 3 中需要 0.1 秒,然后在其余循环中进一步降低到 0.3 秒。因此,我们看到性能下降了 10 倍。
额外信息
为了确保唯一的降级因素是sort_by_key
,我将上面的替换为使用手写内核的手动排序:
thrust::device_vector<int> indexVec( keyVec.size() );
thrust::sequence( indexVec.begin(), indexVec.end() );
// Sort the keys and indexes
thrust::sort_by_key( keyVec.begin(), keyVec.end(), indexVec.begin() );
thrust::device_vector<int> valVec2( keyVec.size() );
thrust::device_vector<Segment> segVec2( keyVec.size() );
// Use index array and move vectors to destination
moveKernel<<< x, y >>>(
toRawPtr( indexVec ),
indexVec.size(),
toRawPtr( valVec ),
toRawPtr( segVec ),
toRawPtr( valVec2 ),
toRawPtr( segVec2 ) );
// Swap back into original vectors
valVec.swap( valVec2 );
segVec.swap( segVec2 );
这种手写排序需要0.03s并且这个性能是一致的在所有迭代中,与 sort_by_key 和 zip_iterator 看到的性能下降不同。
Problem
I am using sort_by_key
with the values being passed using a zip_iterator
. This sort_by_key
is called many times, and after a certain iteration it becomes 10x slower! What is the cause of this drop in performance?
Symptom
I am sorting 3 vectors using sort_by_key
, one of them acts as the key vector:
struct Segment
{
int v[2];
};
thrust::device_vector<int> keyVec;
thrust::device_vector<int> valVec;
thrust::device_vector<Segment> segVec;
// ... code which fills these vectors ...
thrust::sort_by_key( keyVec.begin(), keyVec.end(),
make_zip_iterator( make_tuple( valVec.begin(), segVec.begin() ) ) );
The size of the vector is usually about 4 million. In the initial 2 times it is called, the sort_by_key
takes 0.04s, in loop 3 it takes 0.1s and then degrades further to 0.3s for the rest of the loops. Thus, we see a 10x degradation in performance.
Extra Information
To ensure that the only factor of degradation was sort_by_key
, I replaced the above with manual sorting using a handwritten kernel:
thrust::device_vector<int> indexVec( keyVec.size() );
thrust::sequence( indexVec.begin(), indexVec.end() );
// Sort the keys and indexes
thrust::sort_by_key( keyVec.begin(), keyVec.end(), indexVec.begin() );
thrust::device_vector<int> valVec2( keyVec.size() );
thrust::device_vector<Segment> segVec2( keyVec.size() );
// Use index array and move vectors to destination
moveKernel<<< x, y >>>(
toRawPtr( indexVec ),
indexVec.size(),
toRawPtr( valVec ),
toRawPtr( segVec ),
toRawPtr( valVec2 ),
toRawPtr( segVec2 ) );
// Swap back into original vectors
valVec.swap( valVec2 );
segVec.swap( segVec2 );
This handwritten sort takes 0.03s and this performance is consistent across all iterations, unlike the performance drop seen with sort_by_key and zip_iterator.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
为了准确地计时每个循环,您需要在每个循环结束时使用 cudaThreadSynchronize。您获得的前两个循环的计时可能不是您正在寻找的实际计时。
For accurate timing across each loop, you need to use cudaThreadSynchronize at the end of each loop. The timings you are getting for the first two loops may not be the actual timing you are looking for.