Thrust::Sort 编译时间很长
我正在尝试使用 Thrust 编译一段示例代码,以帮助学习一些 CUDA。
我正在使用 Visual Studio 2010,并且我还获得了其他要编译的示例。然而,当我编译这个例子时,编译需要花费 10 分钟以上。我选择性地注释掉了一些行,并发现 Thrust::sort 行需要永远运行(注释掉这一行后,编译需要大约 5 秒)。
我在某个地方找到了一篇文章,讨论了 Thrust 中排序的编译速度如何缓慢,这是 Thrust 开发团队做出的决定(运行时速度快了 3 倍,但编译时间更长)。但那篇文章是在 2008 年底发布的。
知道为什么要花这么长时间吗?
另外,我正在具有以下规格的机器上进行编译,因此它不是一台慢机器
i7-2600k @ 4.5 ghz
16 GB DDR3 @ 1833 mhz
Raid 0 of 6 GB/s 1TB 驱动器
根据要求,这是看起来 Visual Studio 正在调用
C:\Program Files\NVIDIA GPU Compute Toolkit\CUDA\v3.2\bin\nvcc.exe" -ccbin "C:\Program Files (x86)\Microsoft Visual 的构建字符串Studio 9.0\VC\bin" -I"C:\Program Files\NVIDIA GPU 计算工具包\CUDA\v3.2\include" -G0 --keep-dir "调试\" -maxrregcount=32 --machine 64 --编译 -D_NEXUS_DEBUG -g -Xcompiler "/EHsc /nologo /Od /Zi /MTd" -o "Debug\kernel.obj" "C:\Users\Rob\Desktop\VS2010Test\VS2010Test\VS2010Test\kernel.cpp" -clean
例子
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
int main(void)
{
// generate 16M random numbers on the host
thrust::host_vector<int> h_vec(1 << 24);
thrust::generate(h_vec.begin(), h_vec.end(), rand);
// transfer data to the device
thrust::device_vector<int> d_vec = h_vec;
// sort data on the device
thrust::sort(d_vec.begin(), d_vec.end());
// transfer data back to host
thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());
return 0;
}
I'm trying to compile a block of example code using Thrust in an attempt to help learn some CUDA.
I'm using Visual Studio 2010, and I've gotten other examples to compile. However, when I compile this example, it takes upwards of 10 minutes to compile. I've selectively commented out lines and figured out that its the Thrust::sort line that takes forever (with that one line commented out it takes about 5 seconds to compile).
I found a post somewhere that talked about how sort was slow to compile in Thrust and that was a decision that the Thrust development team made (its 3x faster at runtime, but takes longer to compile). But that post was in late 2008.
Any idea why this is taking so long?
Also, I'm compiling on a machine with the following specs, so its not a slow machine
i7-2600k @ 4.5 ghz
16 GB DDR3 @ 1833 mhz
Raid 0 of 6 GB/s 1TB drives
As requested, this is the build string that it looks like Visual Studio is invoking
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v3.2\bin\nvcc.exe" -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v3.2\include" -G0 --keep-dir "Debug\" -maxrregcount=32 --machine 64 --compile -D_NEXUS_DEBUG -g -Xcompiler "/EHsc /nologo /Od /Zi /MTd " -o "Debug\kernel.obj" "C:\Users\Rob\Desktop\VS2010Test\VS2010Test\VS2010Test\kernel.cpp" -clean
Example
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
int main(void)
{
// generate 16M random numbers on the host
thrust::host_vector<int> h_vec(1 << 24);
thrust::generate(h_vec.begin(), h_vec.end(), rand);
// transfer data to the device
thrust::device_vector<int> d_vec = h_vec;
// sort data on the device
thrust::sort(d_vec.begin(), d_vec.end());
// transfer data back to host
thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());
return 0;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
CUDA 3.2 中的编译器并未针对使用调试模式(即
nvcc -G0
)编译sort
等长而复杂的程序进行优化。你会发现在这种情况下CUDA 4.0要快得多。删除-G0
选项也会显着减少编译时间。The compiler in CUDA 3.2 was not optimized for compiling long, complex programs like
sort
using debugging mode (i.envcc -G0
). You will find that CUDA 4.0 is much faster in this case. Removing the-G0
option should decrease compilation time by a significant fraction as well.