如何获取预读字节?

发布于 2024-08-30 20:18:42 字数 411 浏览 4 评论 0原文

操作系统从磁盘读取的内容比程序实际请求的内容多,因为程序将来可能需要附近的信息。在我的应用程序中,当我从磁盘获取项目时,我想显示该元素周围的信息间隔。我请求和显示的信息量与速度之间需要权衡。但是,由于操作系统已经读取了比我请求的更多的内容,因此访问内存中已有的这些字节是免费的。我可以使用什么 API 来查找操作系统缓存中的内容?

或者,我可以使用内存映射文件。在这种情况下,问题就简化为确定页面是否被交换到磁盘。这可以在任何常见操作系统中完成吗?

编辑:相关论文http://www.azulsystems.com/events/mspc_2008/2008_MSPC。 pdf

Operating systems read from disk more than what a program actually requests, because a program is likely to need nearby information in the future. In my application, when I fetch an item from disk, I would like to show an interval of information around the element. There's a trade off between how much information I request and show, and speed. However, since the OS already reads more than what I requested, accessing these bytes already in memory is free. What API can I use to find out what's in the OS caches?

Alternatively, I could use memory mapped files. In that case, the problem reduces to finding out whether a page is swapped to disk or not. Can this be done in any common OS?

EDIT: Related paper http://www.azulsystems.com/events/mspc_2008/2008_MSPC.pdf

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

池予 2024-09-06 20:18:42

你确实可以使用第二种方法,至少在 Linux 上是这样。 mmap() 文件,然后使用 mincore() 函数确定哪些页面驻留。从手册页:

int mincore(void *addr, size_t
长度,无符号字符 *vec);

mincore() 返回一个向量
指示是否呼叫的页面
进程的虚拟内存是常驻的
在核心(RAM)中,因此不会导致
磁盘访问(页面错误)如果
参考。内核返回
有关的居住信息
从地址 addr 开始的页面,
并继续 length 字节。

当然,这里存在竞争条件 - mincore() 可以告诉您某个页面是常驻的,但它可能会在您访问它之前被换出。 这就是生活

You can indeed use your second method, at least on Linux. mmap() the file, then use the mincore() function to determine which pages are resident. From the man page:

int mincore(void *addr, size_t
length, unsigned char *vec);

mincore() returns a vector that
indicates whether pages of the calling
process's virtual memory are resident
in core (RAM), and so will not cause a
disk access (page fault) if
referenced. The kernel returns
residency information about the
pages starting at the address addr,
and continuing for length bytes.

There's of course a race condition here - mincore() can tell you that a page is resident, but it might then be swapped out just before you access it. C'est la vie.

机场等船 2024-09-06 20:18:42

你是从错误的假设开始的。至少在 Linux 上,操作系统会尝试找出程序的访问模式。如果按顺序读取文件,内核将按顺序预取。如果您多次跳转文件,内核一开始可能会感到困惑,但随后它将停止预取。

因此,如果您实际上按顺序访问文件,您就知道可能预取的内容:下一个数据块。如果您随机查找,附近可能没有其他内容被预取。

尝试以不同的方式解决这个问题。在调用 read() 获取您需要的信息之前,请调用 fadvise() 让操作系统知道您想要什么它开始加载。

我也很想知道您的应用程序类型正在使用它可以通过仅对偶然出现在文件缓存中的数据进行操作来正确运行。我觉得如果您发布更多信息,我们可以找到一个好方法来满足您的需求。

You're starting out from a wrong presumption. At least on Linux, the OS will try to figure out the program's access patterns. If you read a file sequentially, the kernel will prefetch sequentially. If you jump around the file a lot, the kernel will probably be confused at first, but then it will stop prefetching.

So if you actually are accessing your file sequentially, you know what's probably prefetched: the next data block. If you are randomly seeking, probably nothing else in the vicinity is prefetched.

Try to approach this a different way. Before calling read() to get the information you need, call fadvise() to let the OS know what you want it to start loading..

I'm also curious to know what kind of application you're using that can run correctly by only operating on data that happens to be in the file cache by chance. I feel like we could find a good way to address your need if you posted a little more info.

泪是无色的血 2024-09-06 20:18:42

这在 Windows 上肯定是做不到的。在 Windows 上,预读行为取决于操作系统,即使它可以告诉您预读了多少内容,也不会对您有任何好处,因为一旦您发现,内存中的页面用于缓存的数据可能会被回收以用于其他用途。

确定页面是否常驻也是如此。一旦您发现当其他线程需要内存用于其他任务时,答案可能会改变。

如果你真的想在 Windows 上做一些事情,你可以关闭缓冲并自己管理缓冲区。这是最快的 IO 路径,但它也是最复杂的 - 你必须非常小心,而且通常操作系统仍然可以做得更好。

It certainly can't be done on Windows. On windows the read ahead behaviour is up to the OS, and even if it could tell you how much it had read ahead, it wouldn't do you any good because as soon as you'd found out, the in memory pages which are used for caching could have been reclaimed for some other use.

The same thing goes for determining whether a page is resident or not. As soon as you've found out the answer might change when some other thread needs the memory for something else.

If you really wanted to do thins kind of thing on Windows you can turn off buffering and manage the buffers yourself. This is the fastest IO path, but it is also the most complex - you have to be very careful, and often the OS can still do it better.

溺深海 2024-09-06 20:18:42

我可以使用什么 API 来查找操作系统缓存中的内容?

对于任何 posix 系统来说,当然没有标准的方法可以做到这一点,而且我不知道有任何特定于 Linux 的非标准方法。您(几乎)可以确定的唯一事情是文件系统将读取页面大小的倍数,通常为 4kB。因此,如果您的读取量很小,您就可以很有可能(尽管不确定)知道周围页面中的数据在内存中。

我想,你可以做一些棘手的事情,比如计时读取系统完成所需的时间。如果速度很快,即 100 微秒或更短,则可能是缓存命中。一旦达到一毫秒左右,就可能是缓存未命中。当然,这实际上并没有多大帮助,而且非常非常脆弱。

请注意,一旦文件系统将数据复制到用户缓冲区,就可以立即丢弃保存磁盘数据的缓冲区。它可能不会立即执行此操作,但您无法确定。

最后,我同意@Karmastan 的建议:解释一下你想要实现的更广泛的目标。可能有一种方法可以做到这一点,但您建议的方法不是。

What API can I use to find out what's in the OS caches?

There's certainly no standard way to do this for any posix system, and I not aware of any non-standard way specific to Linux. The only thing you can know (almost) for sure is that the file system will have read in a multiple of the page size, usually 4kB. So, if your reads are small, you can know with high probability (although not for sure) that the data in the surrounding page is in memory.

You could, I suppose, do tricksy things like timing how long it took a read system to complete. If it's fast, that is 100s of microseconds or less, it was probably a cache hit. Once it gets up to a millisecond or so, it was probably a cache miss. Of course, this doesn't actually help you very much, and it's very very fragile.

Please note that once the file system has copied the the data to user buffers, it is free to immediately discard the buffers holding the data from disk. It probably doesn't do this right away, but you can't tell for sure.

Finally, I second @Karmastan's suggestion: explain the broader end you're trying to achieve. There's likely a way to do it, but the one you've suggested isn't it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文