用于管理共享映射文件的库或工具

发布于 2024-12-14 23:47:05 字数 393 浏览 3 评论 0原文

免责声明:这可能是一个研究问题,因为我找不到我要找的东西,而且它相当具体。

问题:我有一个自定义搜索应用程序,需要读取 100K 到 10M 个文件,每个文件大小在 0.01MB 到大约 10.0MB 之间。每个文件包含一个数组,可以通过 mmap 直接将其作为数组加载。我正在寻找一种解决方案,可以在需要文件之前将文件预取到 RAM 中,并且如果系统内存已满,则弹出已处理的文件。

我知道这听起来很像操作系统内存管理和 memcached 之类的东西的组合。我实际上正在寻找类似 memcached 的东西,它不返回键的字符串或值,而是返回所选数组的开头地址。另外,(这是一个不同的主题)我希望能够管理共享内存,使得 NUMA 机器上 CPU 核心和 RAM 之间的距离最短。

我的问题是:“这样的工具/库是否已经存在?”

Disclaimer: This is probably a research question as I cannot find what I am looking for, and it is rather specific.

Problem: I have a custom search application that needs to read between 100K and 10M files that are between 0.01MB to about 10.0MB each. Each file contains one array that could be directly loaded as an array via mmap. I am looking for a solution to prefetch files into RAM before they are needed and if the system memory is full, eject ones that were already processed.

I know this sounds a lot like a combination of OS memory management and something like memcached. What I am actually looking for is something like memcached that doesn't return strings or values for a key, but rather the address for the start of a chosen array. In addition, (this is a different topic) I would like to be able to have the shared memory managed such that the distance between the CPU core and the RAM is the shortest on NUMA machines.

My question is: "does a tool/library like this already exist?"

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

锦上情书 2024-12-21 23:47:05

您的问题与这个相关,

我不确定您需要找到一个图书馆。您只需要了解如何有效地使用系统调用即可。

我相信 readahead 系统调用可以帮助您。

Your question is related to this one

I'm not sure you need to find a library. You just need to understand how to efficiently use system calls.

I believe the readahead system call could help you.

万人眼中万个我 2024-12-21 23:47:05

事实上,您有很多很多文件(也许太多)。我希望您的文件系统足够好,或者它们位于多个目录中。如果没有适当调整,拥有数百万个文件可能会成为一个问题(但我不敢在这方面提供帮助)。

我不知道是不是你的应用程序写了&读取那么多文件。也许您可能会考虑切换到快速的DBMS,例如PostGresQLMySQL,或者您可以使用<一href="http://www.gnu.org.ua/software/gdbm/" rel="nofollow">GDBM。

Indeed you have many many files (and perhaps too much of them). I hope that your filesystem is good enough, or that they are in many directories. Having millions of files may become a concern if not tuned appropriately (but I won't dare help on this).

I don't know if it is your application who writes & reads that many files. Perhaps you might consider switching to a fast DBMS like PostGresQL or MySQL, or perhaps you could use GDBM.

深海夜未眠 2024-12-21 23:47:05

我曾经为搜索引擎类型的应用程序做过此操作。它使用 LRU 链,该链也可以通过文件 ID 和内存地址 IIRC 进行寻址(通过哈希表)。每次访问时,热门项都会被重新定位到 LRU 链的头部。当内存紧张时(mmap可能失败......)LRU链的尾部被取消映射。

该方案的缺陷是程序可能会因页面错误而被阻止。由于它是单线程的,所以它确实被阻塞了。将其更改为多线程架构将涉及通过锁和信号量来保护哈希和 LRU 结构。

之后,我意识到我正在做双缓冲:操作系统本身有一个完美的LRU磁盘缓冲机制,这可能比我的更聪明。只需对每个请求打开()或 mmap()每个文件只需一次系统调用,并且(考虑到最近的活动)与缓冲层一样快,甚至更快。

wrt DBMS:使用 DBMS 是一种简洁的设计,但是仅仅为了获取第一个数据块,您就需要最少 3 次系统调用的开销。它肯定会(总是)阻塞。但它非常适合多线程设计,并让您摆脱锁和缓冲区管理的痛苦。

I have once done this for a search-engine kind of application. It used an LRU chain, which was also addressable (via a hash table) by file-id, and memory-address IIRC. On every access, the hot items were repositioned to the head of the LRU chain. When memory got tight (mmap can fail ...) the tail of the LRU-chain was unmapped.

The pitfall of this scheme is that the program can get blocked on pagefaults. And since it was single threaded, it was really blocked. Altering this to a multithreaded architecture would involve protecting the hash and LRU structures by locks and semaphores.

After that, I realised that I was doing double buffering: the OS itself has a perfect LRU diskbuffer mechanism, which is probably smarter then mine. Just open()ing or mmap()ing every single file on every request is only one sytemcall away, and (given recent activity) just as fast, or even faster than the buffering layer.

wrt DBMS: using a DBMS is a clean design, but you have the overhead of minimal 3 systemcalls just to get the first block of data. And it will certainly (always) block. But it lends itself reasonably for a multi-threaded design, and relieves you from the pain of locks and buffer management.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文