将 CUDA cudaMemcpy 分成块
我和一位同事正在集思广益,讨论如何减少主机和设备之间的内存传输时间,结果发现,也许将事情安排到一次大型传输(即一次调用)可能会有所帮助。这导致我创建了一个测试用例,其中我选择了传输少量大数据块与传输许多小数据块的时间。我得到了一些非常有趣/奇怪的结果,想知道这里是否有人有解释?
我不会把整个代码放在这里,因为它很长,但我用两种不同的方式测试了分块:
显式写出所有 cudaMemcpy,例如:
cudaEventRecord(开始, 0);
cudaMemcpy(aD, a, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 1*nBytes/10, a + 1*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 2*nBytes/10, a + 2*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 3*nBytes/10, a + 3*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 4*nBytes/10, a + 4*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 5*nBytes/10, a + 5*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 6*nBytes/10, a + 6*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 7*nBytes/10, a + 7*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 8*nBytes/10, a + 8*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 9*nBytes/10, a + 9*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaEventRecord(停止, 0);
cudaEventSynchronize(停止);
cudaEventElapsedTime(&时间、开始、停止);将 cudaMemcpy 放入 for 循环中:
cudaEventRecord(开始, 0);
for(int i = 0; i < nChunks; i++)
{
cudaMemcpy(aD + i*nBytes/nChunks, a + i*nBytes/nChunks, nBytes/nChunks, cudaMemcpyHostToDevice);
}
cudaEventRecord(停止, 0);
cudaEventSynchronize(停止);
cudaEventElapsedTime(&时间、开始、停止);
需要注意的是,我还在每次测试开始时进行了“热身”传输,以防万一,尽管我认为不需要(上下文是由 cudaMalloc 调用创建的)。
我在 1 MB 到 1 GB 的总传输大小上对此进行了测试,其中每个测试用例都传输相同数量的信息,无论信息如何分块。我的输出示例如下:
单次大传输 = 0.451616 ms
10 次显式传输 = 0.198016 ms
100 次显式传输 = 0.691712 ms
10 次循环传输 = 0.174848 ms
100 次循环传输 = 0.683744 ms
1000 次循环传输 = 6.145792 ms
10000 次循环传输 = 104.981247 ms
100000 次循环传输 = 13097.441406 毫秒
这里有趣的是,我不明白的是,从整体来看,这 10 次传输总是比其他传输快很多,甚至是单个大传输!无论数据集有多大或多小,结果都保持一致(即 10x100MB 与 1x1GB 或 10x1MB 与 1x10MB 仍然会导致 10 倍更快)。如果有人对为什么会这样或者我可能做错了什么以获得这些奇怪的数字有任何见解,我将非常有兴趣听听你要说什么。
谢谢!
PS我知道cudaMemcpy带有隐式同步,所以我可以使用CPU计时器,而cudaEventSynchronize是多余的,但我认为最好是安全起见
更新:我编写了一个函数来尝试利用表演时空连续体中的这种明显的裂痕。不过,当我使用该函数时(如在我的测试用例中那样编写为 EXACLTY),效果就会消失,并且我看到了我所期望的结果(单个 cudaMemcpy 是最快的)。也许这更类似于量子物理学而不是相对论,其中观察的行为改变了行为......
A co-worker and I were brainstorming on how to mitigate the memory transfer time between host and device and it came up that perhaps arranging things to one mega-transfer (i.e. one single call) might help. This led me to create a test case where I took timings of transferring few large data chunks vs. many small data data chunks. I got some very interesting/strange results, and was wondering if anyone here had an explanation?
I won't put my whole code up here since it's quite long, but I tested the chunking in two different ways:
Explicitly writing out all cudaMemcpy's, e.g.:
cudaEventRecord(start, 0);
cudaMemcpy(aD, a, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 1*nBytes/10, a + 1*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 2*nBytes/10, a + 2*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 3*nBytes/10, a + 3*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 4*nBytes/10, a + 4*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 5*nBytes/10, a + 5*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 6*nBytes/10, a + 6*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 7*nBytes/10, a + 7*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 8*nBytes/10, a + 8*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 9*nBytes/10, a + 9*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);Putting the cudaMemcpy's into a for loop:
cudaEventRecord(start, 0);
for(int i = 0; i < nChunks; i++)
{
cudaMemcpy(aD + i*nBytes/nChunks, a + i*nBytes/nChunks, nBytes/nChunks,
cudaMemcpyHostToDevice);
}
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
To note, I also did a "warm-up" transfer at the start of each test just in case, though I don't think it was needed (the context was created by a cudaMalloc call).
I tested this on total transfer sizes ranging from 1 MB to 1 GB, where each test case transferred the same amount of information regardless of how it was chunked up. A sample of my output is this:
single large transfer = 0.451616 ms
10 explicit transfers = 0.198016 ms
100 explicit transfers = 0.691712 ms
10 looped transfers = 0.174848 ms
100 looped transfers = 0.683744 ms
1000 looped transfers = 6.145792 ms
10000 looped transfers = 104.981247 ms
100000 looped transfers = 13097.441406 ms
What's interesting here and what I don't get is that, across the board, the 10 transfers were ALWAYS faster by a significant amount than any of the others, even the single large transfer! And that result stayed consistent no matter how large or small the data set was (i.e. 10x100MB vs 1x1GB or 10x1MB vs 1x10MB still results in the 10x being faster). If anyone has any insight on why this is or what I may be doing wrong to get these weird numbers, I would be very interested to hear what you have to say.
Thanks!
P.S. I know that cudaMemcpy carries with it an implicit synchronization and so I could have used a CPU timer and that cudaEventSynchronize is redundant, but I figured it was better to be on the safe side
UPDATE: I wrote a function to try and take advantage of this apparent rip in the performance space-time continuum. When I use that function, though, which is written EXACLTY as in my test cases, the effect goes away and I see what I expect (a single cudaMemcpy is fastest). Perhaps this is all more akin to quantum physics than relativity wherein the act of observing changes the behavior...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
cudaMemcpy() 是同步的 - CUDA 会等到 memcpy 完成后再返回到您的应用程序。
如果您调用 cudaMemcpyAsync(),驱动程序将在 GPU 执行 memcpy 之前将控制权返回给您的应用程序。
调用 cudaMemcpyAsync() 而不是 cudaMemcpy() 至关重要。不是因为您希望将传输与 GPU 处理重叠,而是因为这是获得 CPU/GPU 并发的唯一方法。
在 Amazon EC2 中的 cg1.4xlarge 实例上,驱动程序需要约 4 微秒的时间来请求 GPU 的内存;因此CPU/GPU并发是隐藏驱动程序开销的好方法。
对于您在 10 处看到的差异,我没有现成的解释 - 我期望看到的主要拐点是 memcpy 大小超过 64K 的地方。驱动程序将小于 64K 的 memcpy 内联到用于提交命令的同一缓冲区中。
cudaMemcpy() is synchronous - CUDA waits until the memcpy is done before returning to your app.
If you call cudaMemcpyAsync(), the driver will return control to your app before the GPU necessarily has performed the memcpy.
It's critical that you call cudaMemcpyAsync() instead of cudaMemcpy(). Not because you want to overlap the transfers with GPU processing, but because that is the only way you will get CPU/GPU concurrency.
On a cg1.4xlarge instance in Amazon EC2, it takes ~4 microseconds for the driver to request a mempy of the GPU; so CPU/GPU concurrency is a good way to hide driver overhead.
I don't have a ready explanation for the disparity you are seeing at 10 - the main knee I'd expect to see is where the memcpy crosses over 64K in size. The driver inlines memcpy's smaller than 64K into the same buffer used to submit commands.
在每次cuda调用之前和之后使用cudaThreadSynchronize()来获取真实的内存传输时间,cudaMemcpy()是同步的,但不与CPU执行同步,它取决于调用的函数。
Cuda 函数调用与其他 cuda 函数调用(如其他内存传输或内核执行)同步,这是在 CUDA 开发人员不可见的不同 CUDA 线程中进行管理的。 cudaMemcpyAsync() 与其他 CUDA 调用异步,这就是为什么它需要复制的 GPU 内存段不与其他并发内存传输重叠。
您确定在这种情况下,在 CUDA 执行线程中同步的 cudaMemcpy() 也与 CPU 线程同步吗?取决于cuda函数,它可以是也可以不是,但是如果你在测量时间时使用cudaThreadSynchronize函数,它肯定会与CPU同步,并且每个步骤的真实时间都会出现。
Use the cudaThreadSynchronize() before and after each cuda call to get the real memory transfer time, cudaMemcpy() is synchronous but not with the CPU execution, it depends on the function called.
Cuda function calls are synchronous with other cuda function calls like other memory transfers or kernel execution, this is managed in a different CUDA thread invisible to the CUDA developer. cudaMemcpyAsync() is asynchronous with other CUDA calls, that is why it needs that the GPU memory segments copied do not overlap with other concurrent memory transfers.
Are you sure that in this case cudaMemcpy(), which is synchronous in the CUDA execution thread, is being synchronous also with the CPU thread? Well depending of the cuda function it can be or not, but if you use the cudaThreadSynchronize function when measuring times it will be synchronous with the CPU for sure, and the real times of each step will appear.
也许这是 CUDA 测量时间的一些特殊之处。您测量的时间小于 1 毫秒,这是非常小的。
您是否尝试使用基于 CPU 的计时器来计时并比较结果?
Perhaps it is some peculiarity in how CUDA measures time. You are measuring times which are less than 1 ms, which is very small.
Did you try to time it with CPU based timer and compare results?