为什么这段代码的多线程会导致如此不一致的计时?
我有一个处理大图像的函数。根据规范,该图像最大可为 55mb。该处理需要将图像分成几个不同的波段,然后通过将这些波段添加回输出图像来重构图像。由于图像太大,我无法在 32 位系统上同时将所有四个图像以及输入和输出图像保留在内存中。因此,我将每个图像放在磁盘上,然后将其分部分读回。
在多线程之前,伪代码如下所示:
for (y is 0 to ysize)
unsigned short* ptr1 = ReadLineFromDisk(image1, y)
unsigned short* ptr2 = ReadLineFromDisk(image2, y)
unsigned short* ptr3 = ReadLineFromDisk(image3, y)
unsigned short* ptr4 = ReadLineFromDisk(image4, y)
unsigned short* outPtr = &(outImage[y*inXSize])
for (x is 0 to xsize, ++x, ++ptr1, ++ptr2, ++ptr3, ++ptr4, ++outPtr){
outPtr = combination of ptr1, ptr2, ptr3, ptr4;
}
}
该代码在使用高性能计数器的具有标准 500 GB 硬盘驱动器的双核机器上运行 3 秒。
如果将从磁盘读取的行数增加到大约 100 行,然后使用如下所示的代码逐步执行该操作:
chunksize = 100;
for (y is 0 to ysize by chunksize)
unsigned short* ptr1 = ReadChunkFromDisk(image1, y)
unsigned short* ptr2 = ReadChunkFromDisk(image2, y)
unsigned short* ptr3 = ReadChunkFromDisk(image3, y)
unsigned short* ptr4 = ReadChunkFromDisk(image4, y)
unsigned short* outPtr = &(outImage[y*inXSize])
for (x is 0 to xsize*chunk, ++x, ++ptr1, ++ptr2, ++ptr3, ++ptr4, ++outPtr){
outPtr = combination of ptr1, ptr2, ptr3, ptr4;
}
}
此代码比之前的代码更快,缩短至 1.5 秒。
问题 1:为什么该代码更快?
我假设它更快,因为根据我的经验,对于相同数量的数据,大型连续读取比较小的读取更快。也就是说,如果我一次读取 100 行数据,这比 100 次单独读取要快,至少对于常规(非 SSD)硬盘来说是这样。我的假设接近正确吗?
即便如此,处理器在这里并没有被密集使用。增加缓存大小实际上是令人望而却步的,因为 1.5 是我能得到的最好值,然后该值似乎会下降一点(也不知道为什么会这样,除了可能有一些磁盘缓存发挥了作用)。这让我想到
问题 2:为什么块大小会有一个最佳点?
如果我理解这里的事情(我认为我真的不理解),如果一切都可以在内存中,那么这将非常快,因为不会有磁盘命中。如果读取更多也能让速度更快,那么一次读取四分之一的图像不是只会稍微影响速度吗?
然后我转而将外循环放在 lambda 表达式中,并使用 Intel 的 TBB 来线程化代码,如下所示
chunksize = 100;
parallel_for (y is 0 to ysize by chunksize in a lambda expression)
unsigned short* ptr1 = ReadChunkFromDisk(image1, y)
unsigned short* ptr2 = ReadChunkFromDisk(image2, y)
unsigned short* ptr3 = ReadChunkFromDisk(image3, y)
unsigned short* ptr4 = ReadChunkFromDisk(image4, y)
unsigned short* outPtr = &(outImage[y*inXSize])
for (x is 0 to xsize*chunk, ++x, ++ptr1, ++ptr2, ++ptr3, ++ptr4, ++outPtr){
outPtr = combination of ptr1, ptr2, ptr3, ptr4;
}
}
:代码速度范围从 0.4 秒到 1.6 秒。
这让我想到:
问题 3:速度提升最多不是 2 倍,而不是 4 倍吗?
这是我运行这些基准测试的双核机器,所以在完美的世界中,一个线程从磁盘读取数据,而其他线程则进行处理。即使以 4 倍速度提升运行,它也只使用 80% 的处理器,而不是 100%,因此仍然存在磁盘瓶颈。但 4 倍的增长意味着其他事情也在发生。
我还假设,大范围的速度差异是因为线程在读取时没有完全同步,如果速度增加就是这样发生的。真正的最后一个问题是:
问题 4:如何才能持续实现 4 倍的速度提升?
I have a function that processes a large image. By the specification, the largest this image can be is 55mb. The processing entails breaking the image into several different bands and then reconstituting the image by adding these bands back into an output image. Because the image is so large, I can't keep all four images plus input and output images in memory simultaneously on a 32 bit system. As a result, I put each image on disk and then read it back in in portions.
Prior to multithreading, the pseudocode looks like:
for (y is 0 to ysize)
unsigned short* ptr1 = ReadLineFromDisk(image1, y)
unsigned short* ptr2 = ReadLineFromDisk(image2, y)
unsigned short* ptr3 = ReadLineFromDisk(image3, y)
unsigned short* ptr4 = ReadLineFromDisk(image4, y)
unsigned short* outPtr = &(outImage[y*inXSize])
for (x is 0 to xsize, ++x, ++ptr1, ++ptr2, ++ptr3, ++ptr4, ++outPtr){
outPtr = combination of ptr1, ptr2, ptr3, ptr4;
}
}
This code runs in 3 seconds on a dual core machine with a standard 500 gb hard drive using a high performance counter.
If increase the number of lines read from disk to something like 100, and then step through that, with code that looks like:
chunksize = 100;
for (y is 0 to ysize by chunksize)
unsigned short* ptr1 = ReadChunkFromDisk(image1, y)
unsigned short* ptr2 = ReadChunkFromDisk(image2, y)
unsigned short* ptr3 = ReadChunkFromDisk(image3, y)
unsigned short* ptr4 = ReadChunkFromDisk(image4, y)
unsigned short* outPtr = &(outImage[y*inXSize])
for (x is 0 to xsize*chunk, ++x, ++ptr1, ++ptr2, ++ptr3, ++ptr4, ++outPtr){
outPtr = combination of ptr1, ptr2, ptr3, ptr4;
}
}
This code is faster than the previous code, down to 1.5 seconds.
Question 1: Why is that code faster?
I hypothesize that it's faster because, in my experience, large, contiguous reads are faster than smaller ones for the same amount of data. That is, if I read 100 lines of data all at once, that's faster than 100 individual reads, at least for a regular (non-SSD) hard drive. Is my hypothesis close to correct?
Even so, the processor is not being used intensively here. Increasing the cache size is actually prohibitive, in that 1.5 is the best I can get, and then the value appears to drop off a bit (not sure why that would be either, except that maybe there's some disk caching playing a role). That leads me to
Question 2: Why would there be a sweet spot in the chunk size?
If I understand things here (and I don't think I do, really), if everything could be in memory, then that would be extremely quick, because there would be no disk hits. If reading more also makes things faster, wouldn't reading in say a quarter of the image at a time be only a slight speed hit?
So then I switch to placing the outer loop in a lambda expression and using Intel's TBB to thread the code, something like:
chunksize = 100;
parallel_for (y is 0 to ysize by chunksize in a lambda expression)
unsigned short* ptr1 = ReadChunkFromDisk(image1, y)
unsigned short* ptr2 = ReadChunkFromDisk(image2, y)
unsigned short* ptr3 = ReadChunkFromDisk(image3, y)
unsigned short* ptr4 = ReadChunkFromDisk(image4, y)
unsigned short* outPtr = &(outImage[y*inXSize])
for (x is 0 to xsize*chunk, ++x, ++ptr1, ++ptr2, ++ptr3, ++ptr4, ++outPtr){
outPtr = combination of ptr1, ptr2, ptr3, ptr4;
}
}
This code ranges in speeds from 0.4 seconds to 1.6 seconds.
That brings me to:
Question 3: Shouldn't that speed increase be, at most, 2x, not 4x?
This is a dual core machine I'm running these benchmarks on, so in a perfect world, one thread reads from disk while the other processes. Even when it runs at the 4x speed increase, it's only using 80% of the processors, not 100%, so there is a disk bottleneck still left. But a 4x increase means that something else is happening as well.
I also assume that the wide range of speed differences is that the threads are not perfectly synchronized on their reads, if that's how the speed increase is happening. The real, final question, is:
Question 4: How can I get that 4x speed increase consistently?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
答案 1: 是的,您受磁盘限制,因此 CPU 不会被固定那么多,是的,读取较大的块会更高效(只要这些块与磁盘缓存对齐) )。
答案 2: 具有 8 MB 缓存且以 10k RPM 速度旋转的磁盘可能会获得 60 到 80 MB/秒的吞吐量,因此“最佳位置”是读取与缓存大小。 您可以增加缓冲区,但要使其与缓存大小保持一致:即 8MB、16MB、32MB 等。
答案 3: 理想情况下,您希望将一个线程专用于从磁盘读取数据,另一个处理数据(您可能希望使用多个线程进行处理)。多线程磁盘读取可能会有一些小的性能提升,但一般情况并非如此。我不知道为什么当你获得 4 倍的增长时,你会认为“其他事情”正在发生。
答案 3 更新:
坦率地说,我也不完全知道为什么会发生这种情况,但我也在 .NET 应用程序上的多线程磁盘 I/O 中看到过这种情况。事实上,我什至有一个 C# 测试示例,它演示了与您注意到的相同类型的性能提升 。请注意,在我的测试中,我加载的 HTML 页面与您在“狂野”中看到的内容大致相同(每个页面大约 80 到 160 KB),因此我没有将读取与磁盘缓存对齐。 可能多个线程同时读取实际上更有效,因为尽管您正在进行多次读取,但您仍在利用磁盘缓存。当然,这只是一个即兴假设,我还没有证据支持所以请持保留态度!我认为,如果您的文件足够大,并且您的磁盘读取线程实际上有一个与磁盘缓存对齐的缓冲区,那么添加更多线程根本不会提高您的速度。 如果您仍然发现速度有所提高,请告诉我们!
答案 4:
请尝试以下操作:
再说一遍,您受到磁盘限制,因此您可能永远无法真正获得 100% 的 CPU 利用率。
答案 4 更新:
我不认为英特尔的 TBB 实际上是导致您(和我)看到的性能提升的原因...正如我所说,我最好的猜测是多线程实际上可能更高效如果它们提供了更好的磁盘缓存利用率。我什至不确定这是否是正确的假设,所以不要在没有测试的情况下引用我!
阅读:
我找到了一篇非常详细的论文,标题为具有多个磁盘的商品系统上的异步/多线程 I/O – 一项性能研究,对多线程 I/O 优于单线程 I/O 的情况进行了一些令人惊叹的分析和测试。浏览一下第 86 页。
Dr.多布斯也有一篇关于这个主题的文章,虽然我没有机会阅读整篇文章,但我只是浏览了一下。
Answer 1: Yes you're disk-bound so the CPU will not be pegged all that much and yes reading larger chunks is more efficient (as long as the chunks are aligned with the disk cache).
Answer 2: A disk that has an 8 MB cache and is spinning at 10k RPM might get a throughput of 60 to 80 MB/sec, so the "sweet spot" would be to read chunks aligned with the cache size. You can increase your buffer, but keep it aligned with the cache size: i.e. 8MB, 16MB, 32MB, etc.
Answer 3: Ideally you would want to dedicate one thread to reading from disk and the other to processing the data (you may want to use several threads for processing). Multi-threading the disk reads may have some small performance increase, but it is generally not the case. I don't know why you think "something else" is happening when you get a 4x increase.
Answer 3 Update:
Frankly, I don't exactly know why this is happening either, but I've also seen it with multithreaded disk I/O on .NET applications. As a matter of fact I even have a C# test example which demonstrates the same kind of performance increase that you're noticing. Note that in my test I'm loading HTML pages which are roughly what you would see in the "wild" (about 80 to 160 KB each), so I'm not aligning my reads with the disk cache. It may be possible that multiple threads reading at once are actually more efficient because you are taking advantage of the disk cache despite the fact that you're doing multiple reads. Of course, this is just an off the cuff assumption that I have no evidence to back up yet so please take it with a grain of salt! I think that if your files are large enough and your disk reading thread actually has a buffer aligned with the disk cache, then adding more threads won't improve your speed at all. If you still see an improvement in speed then do let us know!
Answer 4:
Try the following:
And again, you're disk-bound, so you may never really get 100% CPU utilization.
Answer 4 Update:
I don't think that Intel's TBB are actually the thing that's causing the performance increase you (and I) are seeing... as I said, my best guess is that multiple threads may actually be more efficient if they're providing a better utilization of the disk cache. I'm not even sure if that's a correct assumption, so don't quote me without testing!
Reading:
I found a very detailed dissertation, titled Asynchronous/Multi Threaded I/O on Commodity Systems with Multiple Disks – a performance study, that does some amazing analysis and testing of cases where multitreaded I/O outperforms single threaded I/O. Take a look around page 86.
Dr. Dobbs also has an article on the subject, although I didn't have a chance to read the whole thing, I just skimmed through it.