一次 I/O 执行速度比一次读取一点要慢
我正在致力于优化和算法,我们正准备使用 cuda 将其放在 GPU 上。
I/O 部分从 3 个不同的图像中读取,一次读取一行。这正好位于在图像上运行过滤器的循环的中间。我决定尝试预加载通过将 I/O 移除到其自己的循环中生成的值,并将这些值转储到保存图像的数组中,并在计算中使用。
现在的问题是,当缓冲区满载数据时,我的应用程序似乎运行得更慢,而当每次迭代都必须到磁盘获取新数据时,应用程序运行得更快。
可能是什么原因造成的?较大缓冲区的缓存未命中真的会严重影响性能吗?这不是内存问题——这台机器有 24GB 内存,有足够的内存。
不知道还能是什么,愿意倾听想法
I am working on optimizing and algorithm that we are preparing to put on a GPU using cuda.
The I/O part reads in from 3 different images, one row at a time. This was right in the middle of the loop for running the filter over the images. I decided to try to pre-load the values that were being generated by removing the I/O to its own loop, and dumping the values out to arrays that held the images, and were used in the calculation.
Now, the problem is, it seems like my application is running slower with the buffers fully loaded with data, and faster when it was having to go out to disk for new data every iteration.
What could be causing this? Would cache misses from the larger buffers really kill performance that much? Its not a memory issue - with 24GB on this machine it has plenty of ram.
Not sure what else it could be, open to hearing out ideas
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
@Derek 提供了以下附加信息:
这是运行时间的巨大差异。由于使用了 OpenMP,我们可以假设有多个线程。由于您只处理 72MB 的数据,所以我看不出有什么区别I/O 时间可能会那么长,我们可以肯定读取时间比原来的 10-14 秒要短,因此除非该部分代码存在错误,否则额外的时间可能是在过滤器部分。正如@Satya建议分析您的代码或至少添加一些计时打印输出可能有助于确定问题所在。
在循环中读取的“优势”可能是:
鉴于您的最新更新,我们似乎更有可能正在处理#2。不过,需要注意的是内存访问模式(包括所有线程),您可能会看到缓存抖动,因为过去在主内存中相邻的数据现在距离更远。这可能会产生很大的影响,因为如果您有很多内存访问并且它们都是缓存未命中,那么您总是会产生进一步访问数据的成本,这可能是一个数量级的差异。
解决这个问题的方法是将你的记忆按条状排列,例如第一张图像中的 n 行,然后是第二张图像中的 n 行,然后是第三张图像中的 n 行。 IIRC 这种技术称为“条带化”。确切的条带大小取决于您的 CPU,但您可以尝试一下(或者从内部循环中读取的相同数据量开始,如果数据足够大)。
例如:
一次读取一个文件,这样您就不会在驱动器上来回查找。
无论如何,为了最大限度地提高性能,您可能需要考虑使用异步/重叠 I/O,以便在处理前一位图像数据时传入下一位图像数据。
如果您在 Windows 下进行开发,这可以让您开始执行重叠 I/O:
http://msdn.microsoft.com/en -us/library/ms686358%28v=vs.85%29.aspx
一旦您并行执行 I/O,您就可以确定瓶颈是在 I/O 还是在处理中。有不同的技术可以优化它们。
@Derek provided the following additional information:
That is a huge difference in run time. Since OpenMP is used we can assume there are multiple threads. Since you're only dealing with 72MB of data I can't see how the difference in I/O time could be that large. We can be positive the read time is smaller than your original 10-14 seconds so unless you have a bug in that portion of the code the extra time is in the filter section. The images are presumably binary? As @Satya has suggested profiling your code or at least adding some timing printouts may help identify where the problem lies.
The "advantage" of reading in the loop may be:
Given your latest update it does seem more likely we're dealing with #2. Something to watch out for though is the memory access patterns (including all threads), it is possible you are seeing cache thrashing because data that used to adjacent in main memory is now further apart. This could have a large impact because if you have many memory accesses and they are all cache misses you always incur the cost of accessing the data further out which can be an order of magnitude difference.
A solution to this is to arrange your memory in stripes, e.g. n lines from the first image, followed by n lines from the second image, followed by n lines from the third image. IIRC this technique is called "striping". The exact stripe size depends on your CPU but it's something you can experiment with (or start with the same amount of data that used to be read in the inner loop if that's large enough).
E.g.:
Read one file at a time so you're not seeking back and forth on your drive.
Regardless, to maximize performance you probably want to look into using asynchronous/overlapped I/O to have your next bit of image data coming in during the time you are processing the previous bit.
If you're developing under Windows this can give you a start on doing overlapped I/O:
http://msdn.microsoft.com/en-us/library/ms686358%28v=vs.85%29.aspx
Once you are doing your I/O in parallel you can figure out if your bottleneck is in the I/O or in the processing. There are different techniques for optimizing those.
是的,您将图像加载到二级缓存中两次 - 从文件加载图像,然后从内存加载图像。您还必须花费一些时间将数据从缓存移至内存。
作为一个选项,您可以尝试加载一些部分,例如 2-8Mb(取决于您的 L2 缓存大小)
Yes, you load your image into L2 cache twice - when you load it from the file and then from the memory. You have to also spend some time to move data from the cache to the memory.
As an option you could try to load some parts like 2-8Mb (depending of your L2 cache size)
除了@Guy:答案之外,我还应该提到内存映射文件,它们具有两种方法中最好的部分。然而,读取 70Mb 应该需要大约一秒钟的时间,所以问题出在其他地方。
这可能是由核心缓存的一致性引起的。我对此了解不多,但如果两个线程同时对同一内存页(或更糟糕的是同一缓存行)进行写访问,那么它们的缓存必须同步。当您一次读取整个图像时,所有处理线程将同时处理它们。他们会将结果写入接近的内存地址吗?如果您逐行读取图像,它们将花费一些时间等待 I/O 完成,因此这种情况不会经常发生。
In addition to @Guy: answer, I should mention memory mapped files, they have the best parts of both approaches. However, to should take about a second to read 70Mb, so the problem lies somewhere else.
It could be caused by coherence of core caches. I don't know much about this, but if two threads at the same time have write access to the same memory page (or worse, to the same cache line), then their caches have to be synchronized. When you read the whole image at once, then all your processing threads will process them in the same time. Will they write the results in close memory addresses? In case when you read the images line by line, they will spend some time waiting for I/O to complete, so it won't happen so often.