Clock_gettime() CUDA 的计时问题
我想编写一个 CUDA 代码,在那里我可以直接看到 CUDA 在加速应用程序方面所带来的好处。
这是我使用 Thrust 编写的 CUDA 代码 ( http://code.google.com/p/ Thrust/ )
简而言之,代码所做的就是创建两个 2^23 长度的整数向量,一个在主机上,一个在设备上彼此相同,并对它们进行排序。它还(尝试)测量每个人的时间。
在宿主向量上,我使用 std::sort
。在设备向量上,我使用 thrust::sort
。
对于编译我使用
nvcc sortcompare.cu -lrt
程序在终端的输出为
桌面:./a.out
主持人花费的时间是:19。 224622882秒
设备花费的时间是:19。 321644143秒
桌面:
第一个 std::cout 语句在 19.224 秒后生成,如所述。然而,第二个 std::cout 语句(即使它说 19.32 秒)是在第一个语句之后立即生成的 std::cout 语句。请注意,我在 Clock_gettime() 中使用了不同的时间戳进行测量,即 ts_host 和 ts_device
我正在使用 Cuda 4.0 和 NVIDIA GTX 570 计算能力 2.0
#include<iostream>
#include<vector>
#include<algorithm>
#include<stdlib.h>
//For timings
#include<time.h>
//Necessary thrust headers
#include<thrust/sort.h>
#include<thrust/host_vector.h>
#include<thrust/device_vector.h>
#include<thrust/copy.h>
int main(int argc, char *argv[])
{
int N=23;
thrust::host_vector<int>H(1<<N);//create a vector of 2^N elements on host
thrust::device_vector<int>D(1<<N);//The same on the device.
thrust::host_vector<int>dummy(1<<N);//Copy the D to dummy from GPU after sorting
//Set the host_vector elements.
for (int i = 0; i < H.size(); ++i) {
H[i]=rand();//Set the host vector element to pseudo-random number.
}
//Sort the host_vector. Measure time
// Reset the clock
timespec ts_host;
ts_host.tv_sec = 0;
ts_host.tv_nsec = 0;
clock_settime(CLOCK_PROCESS_CPUTIME_ID, &ts_host);//Start clock
thrust::sort(H.begin(),H.end());
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts_host);//Stop clock
std::cout << "\nHost Time taken is: " << ts_host.tv_sec<<" . "<< ts_host.tv_nsec <<" seconds" << std::endl;
D=H; //Set the device vector elements equal to the host_vector
//Sort the device vector. Measure time.
timespec ts_device;
ts_device.tv_sec = 0;
ts_device.tv_nsec = 0;
clock_settime(CLOCK_PROCESS_CPUTIME_ID, &ts_device);//Start clock
thrust::sort(D.begin(),D.end());
thrust::copy(D.begin(),D.end(),dummy.begin());
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts_device);//Stop clock
std::cout << "\nDevice Time taken is: " << ts_device.tv_sec<<" . "<< ts_device.tv_nsec <<" seconds" << std::endl;
return 0;
}
I wanted to write an CUDA code where I could see firsthand the benefits that CUDA offered for speeding up applications.
Here is is a CUDA code I have written using Thrust ( http://code.google.com/p/thrust/ )
Briefly, all that the code does is create two 2^23 length integer vectors,one on the host and one on the device identical to each other, and sorts them. It also (attempts to) measure time for each.
On the host vector I use std::sort
. On the device vector I use thrust::sort
.
For compilation I used
nvcc sortcompare.cu -lrt
The output of the program at the terminal is
Desktop: ./a.out
Host Time taken is: 19 . 224622882 seconds
Device Time taken is: 19 . 321644143 seconds
Desktop:
The first std::cout statement is produced after 19.224 seconds as stated. Yet the second std::cout statement (even though it says 19.32 seconds) is produced immediately after the first
std::cout statement. Note that I have used different time_stamps for measurements in clock_gettime() viz ts_host and ts_device
I am using Cuda 4.0 and NVIDIA GTX 570 compute capability 2.0
#include<iostream>
#include<vector>
#include<algorithm>
#include<stdlib.h>
//For timings
#include<time.h>
//Necessary thrust headers
#include<thrust/sort.h>
#include<thrust/host_vector.h>
#include<thrust/device_vector.h>
#include<thrust/copy.h>
int main(int argc, char *argv[])
{
int N=23;
thrust::host_vector<int>H(1<<N);//create a vector of 2^N elements on host
thrust::device_vector<int>D(1<<N);//The same on the device.
thrust::host_vector<int>dummy(1<<N);//Copy the D to dummy from GPU after sorting
//Set the host_vector elements.
for (int i = 0; i < H.size(); ++i) {
H[i]=rand();//Set the host vector element to pseudo-random number.
}
//Sort the host_vector. Measure time
// Reset the clock
timespec ts_host;
ts_host.tv_sec = 0;
ts_host.tv_nsec = 0;
clock_settime(CLOCK_PROCESS_CPUTIME_ID, &ts_host);//Start clock
thrust::sort(H.begin(),H.end());
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts_host);//Stop clock
std::cout << "\nHost Time taken is: " << ts_host.tv_sec<<" . "<< ts_host.tv_nsec <<" seconds" << std::endl;
D=H; //Set the device vector elements equal to the host_vector
//Sort the device vector. Measure time.
timespec ts_device;
ts_device.tv_sec = 0;
ts_device.tv_nsec = 0;
clock_settime(CLOCK_PROCESS_CPUTIME_ID, &ts_device);//Start clock
thrust::sort(D.begin(),D.end());
thrust::copy(D.begin(),D.end(),dummy.begin());
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts_device);//Stop clock
std::cout << "\nDevice Time taken is: " << ts_device.tv_sec<<" . "<< ts_device.tv_nsec <<" seconds" << std::endl;
return 0;
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您没有检查
clock_settime
的返回值。我猜想它失败了,可能是将 errno 设置为 EPERM 或 EINVAL。阅读文档并始终检查您的返回值!如果我是对的,那么您并没有像您想象的那样重置时钟,因此第二个计时是与第一个计时累积的,再加上一些您根本不打算计算的额外内容。
正确的方法是仅调用clock_gettime,首先存储结果,进行计算,然后从结束时间中减去原始时间。
You are not checking the return value of
clock_settime
. I would guess it is failing, probably witherrno
set to EPERM or EINVAL. Read the documentation and always check your return values!If I'm right, you are not resetting the clock as you think you are, hence the second timing is cumulative with the first, plus some extra stuff you don't intend to count at all.
The right way to do this is to call
clock_gettime
only, storing the result first, doing the computation, then subtracting the original time from the end time.