当前位置：文江博客话题详情

Clock_gettime() CUDA 的计时问题

发布于 2024-12-14 17:34:30 字数 2846 浏览 2 评论 0原文

我想编写一个 CUDA 代码，在那里我可以直接看到 CUDA 在加速应用程序方面所带来的好处。

这是我使用 Thrust 编写的 CUDA 代码 ( http://code.google.com/p/ Thrust/ )

简而言之，代码所做的就是创建两个 2^23 长度的整数向量，一个在主机上，一个在设备上彼此相同，并对它们进行排序。它还（尝试）测量每个人的时间。

在宿主向量上，我使用 std::sort。在设备向量上，我使用 thrust::sort。

对于编译我使用

nvcc sortcompare.cu -lrt

程序在终端的输出为

桌面：./a.out
主持人花费的时间是：19。 224622882秒
设备花费的时间是：19。 321644143秒
桌面：

第一个 std::cout 语句在 19.224 秒后生成，如所述。然而，第二个 std::cout 语句（即使它说 19.32 秒）是在第一个语句之后立即生成的 std::cout 语句。请注意，我在 Clock_gettime() 中使用了不同的时间戳进行测量，即 ts_host 和 ts_device

我正在使用 Cuda 4.0 和 NVIDIA GTX 570 计算能力 2.0

  #include<iostream>
    #include<vector>
    #include<algorithm>
    #include<stdlib.h>

    //For timings
    #include<time.h>
    //Necessary thrust headers
    #include<thrust/sort.h>
    #include<thrust/host_vector.h>
    #include<thrust/device_vector.h>
    #include<thrust/copy.h>


    int main(int argc, char *argv[])
    {
      int N=23;
      thrust::host_vector<int>H(1<<N);//create a vector of 2^N elements on host
      thrust::device_vector<int>D(1<<N);//The same on the device.
      thrust::host_vector<int>dummy(1<<N);//Copy the D to dummy from GPU after sorting 

       //Set the host_vector elements. 
      for (int i = 0; i < H.size(); ++i)    {
          H[i]=rand();//Set the host vector element to pseudo-random number.
        }

      //Sort the host_vector. Measure time
      // Reset the clock
        timespec ts_host;
        ts_host.tv_sec = 0;
        ts_host.tv_nsec = 0;
        clock_settime(CLOCK_PROCESS_CPUTIME_ID, &ts_host);//Start clock

             thrust::sort(H.begin(),H.end());

        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts_host);//Stop clock
        std::cout << "\nHost Time taken is: " << ts_host.tv_sec<<" . "<< ts_host.tv_nsec <<" seconds" << std::endl;


        D=H; //Set the device vector elements equal to the host_vector
      //Sort the device vector. Measure time.
        timespec ts_device;
        ts_device.tv_sec = 0;
            ts_device.tv_nsec = 0;
        clock_settime(CLOCK_PROCESS_CPUTIME_ID, &ts_device);//Start clock

             thrust::sort(D.begin(),D.end());
             thrust::copy(D.begin(),D.end(),dummy.begin());


        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts_device);//Stop clock
        std::cout << "\nDevice Time taken is: " << ts_device.tv_sec<<" . "<< ts_device.tv_nsec <<" seconds" << std::endl;

      return 0;
    }

原文

I wanted to write an CUDA code where I could see firsthand the benefits that CUDA offered for speeding up applications.

Here is is a CUDA code I have written using Thrust ( http://code.google.com/p/thrust/ )

Briefly, all that the code does is create two 2^23 length integer vectors,one on the host and one on the device identical to each other, and sorts them. It also (attempts to) measure time for each.

On the host vector I use std::sort. On the device vector I use thrust::sort.

For compilation I used

nvcc sortcompare.cu -lrt

The output of the program at the terminal is

Desktop: ./a.out
Host Time taken is: 19 . 224622882 seconds
Device Time taken is: 19 . 321644143 seconds
Desktop:

The first std::cout statement is produced after 19.224 seconds as stated. Yet the second std::cout statement (even though it says 19.32 seconds) is produced immediately after the first
std::cout statement. Note that I have used different time_stamps for measurements in clock_gettime() viz ts_host and ts_device

I am using Cuda 4.0 and NVIDIA GTX 570 compute capability 2.0

  #include<iostream>
    #include<vector>
    #include<algorithm>
    #include<stdlib.h>

    //For timings
    #include<time.h>
    //Necessary thrust headers
    #include<thrust/sort.h>
    #include<thrust/host_vector.h>
    #include<thrust/device_vector.h>
    #include<thrust/copy.h>


    int main(int argc, char *argv[])
    {
      int N=23;
      thrust::host_vector<int>H(1<<N);//create a vector of 2^N elements on host
      thrust::device_vector<int>D(1<<N);//The same on the device.
      thrust::host_vector<int>dummy(1<<N);//Copy the D to dummy from GPU after sorting 

       //Set the host_vector elements. 
      for (int i = 0; i < H.size(); ++i)    {
          H[i]=rand();//Set the host vector element to pseudo-random number.
        }

      //Sort the host_vector. Measure time
      // Reset the clock
        timespec ts_host;
        ts_host.tv_sec = 0;
        ts_host.tv_nsec = 0;
        clock_settime(CLOCK_PROCESS_CPUTIME_ID, &ts_host);//Start clock

             thrust::sort(H.begin(),H.end());

        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts_host);//Stop clock
        std::cout << "\nHost Time taken is: " << ts_host.tv_sec<<" . "<< ts_host.tv_nsec <<" seconds" << std::endl;


        D=H; //Set the device vector elements equal to the host_vector
      //Sort the device vector. Measure time.
        timespec ts_device;
        ts_device.tv_sec = 0;
            ts_device.tv_nsec = 0;
        clock_settime(CLOCK_PROCESS_CPUTIME_ID, &ts_device);//Start clock

             thrust::sort(D.begin(),D.end());
             thrust::copy(D.begin(),D.end(),dummy.begin());


        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts_device);//Stop clock
        std::cout << "\nDevice Time taken is: " << ts_device.tv_sec<<" . "<< ts_device.tv_nsec <<" seconds" << std::endl;

      return 0;
    }

分享到QQ

分享到微博