当前位置：文江博客话题详情

一次 I/O 执行速度比一次读取一点要慢

发布于 2024-10-13 02:12:53 字数 322 浏览 4 评论 0原文

我正在致力于优化和算法，我们正准备使用 cuda 将其放在 GPU 上。

I/O 部分从 3 个不同的图像中读取，一次读取一行。这正好位于在图像上运行过滤器的循环的中间。我决定尝试预加载通过将 I/O 移除到其自己的循环中生成的值，并将这些值转储到保存图像的数组中，并在计算中使用。

现在的问题是，当缓冲区满载数据时，我的应用程序似乎运行得更慢，而当每次迭代都必须到磁盘获取新数据时，应用程序运行得更快。

可能是什么原因造成的？较大缓冲区的缓存未命中真的会严重影响性能吗？这不是内存问题——这台机器有 24GB 内存，有足够的内存。

不知道还能是什么，愿意倾听想法

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

静水深流 2024-10-20 02:12:53

@Derek 提供了以下附加信息：

（运行时间）...“超过一分钟，相比之下
10 - 14 秒前。我没有做
任何特定的线程，尽管我这样做
有一些 OpenMP 编译指示。移动
滤波器环路外部的 I/O 没有
不过，改变其中任何一个。我是
运行 CentOS 5.5。图像尺寸为
约72MB”

这是运行时间的巨大差异。由于使用了 OpenMP，我们可以假设有多个线程。由于您只处理 72MB 的数据，所以我看不出有什么区别I/O 时间可能会那么长，我们可以肯定读取时间比原来的 10-14 秒要短，因此除非该部分代码存在错误，否则额外的时间可能是在过滤器部分。正如@Satya建议分析您的代码或至少添加一些计时打印输出可能有助于确定问题所在。

在循环中读取的“优势”可能是：

操作系统为您提供了一些并行性，因为它能够执行。一些 I/O 与您的计算并行，例如，当您提前读取所有内容时，您会失去并行性，在读取时有效地阻塞
读取数据。如果处理相对于内存带宽来说是轻量级的，那么缓存未命中确实会降低性能。很难相信这会在这个用例中产生显着的差异，因为磁盘 I/O 比内存慢得多。

鉴于您的最新更新，我们似乎更有可能正在处理#2。不过，需要注意的是内存访问模式（包括所有线程），您可能会看到缓存抖动，因为过去在主内存中相邻的数据现在距离更远。这可能会产生很大的影响，因为如果您有很多内存访问并且它们都是缓存未命中，那么您总是会产生进一步访问数据的成本，这可能是一个数量级的差异。

解决这个问题的方法是将你的记忆按条状排列，例如第一张图像中的 n 行，然后是第二张图像中的 n 行，然后是第三张图像中的 n 行。 IIRC 这种技术称为“条带化”。确切的条带大小取决于您的 CPU，但您可以尝试一下（或者从内部循环中读取的相同数据量开始，如果数据足够大）。

例如：

stripe_number = 0;
do
{
    count = fread(striped_buffer+(STRIPE_SIZE*stripe_number*NUM_IMAGES), 1, STRIPE_SIZE, image_file);
    stripe_number++;
} while(count != 0);

一次读取一个文件，这样您就不会在驱动器上来回查找。

无论如何，为了最大限度地提高性能，您可能需要考虑使用异步/重叠 I/O，以便在处理前一位图像数据时传入下一位图像数据。

如果您在 Windows 下进行开发，这可以让您开始执行重叠 I/O：
http://msdn.microsoft.com/en -us/library/ms686358%28v=vs.85%29.aspx

一旦您并行执行 I/O，您就可以确定瓶颈是在 I/O 还是在处理中。有不同的技术可以优化它们。

@Derek provided the following additional information:

(Run time) ... "is over a minute, compared to
10 - 14 seconds before. I am not doing
any specific threading, though I do
have some OpenMP pragmas. Moving the
I/O outside of the filter loop did not
change any of those though. I am
running CentOS 5.5. The image size is
approx 72MB"

That is a huge difference in run time. Since OpenMP is used we can assume there are multiple threads. Since you're only dealing with 72MB of data I can't see how the difference in I/O time could be that large. We can be positive the read time is smaller than your original 10-14 seconds so unless you have a bug in that portion of the code the extra time is in the filter section. The images are presumably binary? As @Satya has suggested profiling your code or at least adding some timing printouts may help identify where the problem lies.

The "advantage" of reading in the loop may be:

The OS is giving you some parallelism because it is able to perform some of the I/O in parallel with your computation, e.g. reading ahead. You lose that parallelism when you read everything in advance, effectively blocking while reading.
The read data is in the cache at the time that your filter is accessing the data. Cache misses can really kill performance if the processing is lightweight relative to the memory bandwidth. It's hard to believe this would make a significant difference in this use case because disk I/O is so much slower than memory.

Given your latest update it does seem more likely we're dealing with #2. Something to watch out for though is the memory access patterns (including all threads), it is possible you are seeing cache thrashing because data that used to adjacent in main memory is now further apart. This could have a large impact because if you have many memory accesses and they are all cache misses you always incur the cost of accessing the data further out which can be an order of magnitude difference.

A solution to this is to arrange your memory in stripes, e.g. n lines from the first image, followed by n lines from the second image, followed by n lines from the third image. IIRC this technique is called "striping". The exact stripe size depends on your CPU but it's something you can experiment with (or start with the same amount of data that used to be read in the inner loop if that's large enough).

E.g.:

stripe_number = 0;
do
{
    count = fread(striped_buffer+(STRIPE_SIZE*stripe_number*NUM_IMAGES), 1, STRIPE_SIZE, image_file);
    stripe_number++;
} while(count != 0);

Read one file at a time so you're not seeking back and forth on your drive.

Regardless, to maximize performance you probably want to look into using asynchronous/overlapped I/O to have your next bit of image data coming in during the time you are processing the previous bit.

If you're developing under Windows this can give you a start on doing overlapped I/O:
http://msdn.microsoft.com/en-us/library/ms686358%28v=vs.85%29.aspx

Once you are doing your I/O in parallel you can figure out if your bottleneck is in the I/O or in the processing. There are different techniques for optimizing those.

回复收藏 0 原文

囍孤女 2024-10-20 02:12:53

是的，您将图像加载到二级缓存中两次 - 从文件加载图像，然后从内存加载图像。您还必须花费一些时间将数据从缓存移至内存。

作为一个选项，您可以尝试加载一些部分，例如 2-8Mb（取决于您的 L2 缓存大小）

回复收藏 0 原文

旧人九事 2024-10-20 02:12:53

除了@Guy：答案之外，我还应该提到内存映射文件，它们具有两种方法中最好的部分。然而，读取 70Mb 应该需要大约一秒钟的时间，所以问题出在其他地方。

这可能是由核心缓存的一致性引起的。我对此了解不多，但如果两个线程同时对同一内存页（或更糟糕的是同一缓存行）进行写访问，那么它们的缓存必须同步。当您一次读取整个图像时，所有处理线程将同时处理它们。他们会将结果写入接近的内存地址吗？如果您逐行读取图像，它们将花费一些时间等待 I/O 完成，因此这种情况不会经常发生。

回复收藏 0 原文

~没有更多了~