在 jvm 中缓存 tar 以获得更快的文件 I/O?

发布于 2024-07-09 21:18:02 字数 810 浏览 5 评论 0原文

我正在开发一个 Java Web 应用程序,该应用程序使用数千个小文件来构建工件以响应请求。 我认为,如果我们能够将这些文件映射到内存中,而不是一直在磁盘上运行来查找它们,我们的系统可以看到性能改进。

我听说过 Linux 中的 mmap,我对该概念的基本理解是,当从磁盘读取文件时,文件的内容会缓存在内存中的某个位置,以便更快地进行后续访问。 我的想法与这个想法类似,只是我想在我的网络应用程序初始化以获得最小请求时间响应时将整个可映射的文件集读入内存。

我的想法之一是,如果文件都被打包并以某种方式作为虚拟文件系统安装在 JVM 中,我们可能会更快地将它们放入 jvm 内存中。 就目前情况而言,我们当前的实现可能需要几分钟的时间来遍历源文件集并找出磁盘上的所有内容。这是因为我们实际上正在对超过 300,000 个文件进行文件统计。

我发现 apache VFS 项目可以从 tar 文件中读取信息,但我不确定他们的文档是否可以指定诸如“另外,将整个 tar 读入内存并将其保存在那里......”之类的内容。

我们在这里讨论的是一个多线程环境,它提供的工件通常会从 300,000 多个源文件的完整集合中拼凑出大约 100 个不同的文件来做出一个响应。 因此,无论虚拟文件系统解决方案是什么,它都需要线程安全且高性能。 我们这里只讨论读取文件,不讨论写入。

此外,我们运行的是具有 32 GB RAM 的 64 位操作系统,我们的 300,000 个文件占用大约 1.5 到 2.5 GB 的空间。 我们肯定可以比 300K 几千字节大小的小文件更快地将 2.5 GB 的文件读入内存。

感谢您的投入!

  • 贾森

I'm working on a java web application that uses thousands of small files to build artifacts in response to requests. I think our system could see performance improvements if we could map these files into memory rather than run all over the disk to find them all the time.

I have heard of mmap in linux, and my basic understanding of that concept is that when a file is read from disk the file's contents get cached somewhere in memory for quicker subsequent access. What I have in mind is similar to that idea, except I'd like to read the whole mmap-able set of files into memory as my web app is initializing for minimal request-time responses.

One aspect of my thought-train here is that we'd probably get the files into jvm memory faster if they were all tarred up and somehow mounted in the JVM as a virtual file system. As it stands it can take several minutes for our current implementation to walk through the set of source files and just figure out what all is on the disk.. this is because we're essentially doing file stats for upwards of 300,000 files.

I have found the apache VFS project which can read information from a tar file, but I'm not sure from their documentation if you can specify something such as "also, read the entire tar into memory and hold it there..".

We're talking about a multithreaded environment here serving artifacts that usually piece together about 100 different files out of a complete set of 300,000+ source files to make one response. So whatever the virtual file system solution is, it needs to be thread safe and performant. We're only talking about reading files here, no writes.

Also, we're running a 64 bit OS with 32 gig of RAM, our 300,000 files take up about 1.5 to 2.5 gigs of space. We can surely read a 2.5 gigabyte file into memory much quicker than 300K small several-kilobyte-sized files.

Thanks for input!

  • Jason

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

南城追梦 2024-07-16 21:18:02

您可以尝试将所有文​​件放入 JAR 中并将其放在类路径中。 Java 使用一些内置技巧来非常快速地读取 JAR 文件。 这也会将所有文件的目录保留在 RAM 中,因此您不必访问磁盘来查找文件(这会在您开始加载文件之前发生)。

JVM 不会立即将整个 JAR 加载到 RAM 中,而且您可能也不希望这样做,因为您的机器将开始交换。 但它将能够非常快速地找到这些片段,因为它将使文件始终保持打开状态,因此,您不会丢失任何打开/关闭文件资源的时间。

此外,由于您一直在使用这个单个文件,因此操作系统很可能会将其在文件缓存中保留更长时间。

最后,您可以尝试压缩 JAR。 虽然这听起来像是一个坏主意,但您应该尝试一下。 如果小文件压缩得很好,用当前CPU解包的时间比从磁盘读取数据的时间要低得多。 如果您不必在任何地方保留中间数据,则可以将未压缩的数据流式传输到客户端,而无需写入文件(这会破坏整个想法)。 这样做的缺点是它确实会消耗 CPU 周期,并且如果您的 CPU 很忙(只需使用某些负载工具检查一下;如果它高于 20%,那么您就松了),那么您将使整个过程变慢。

也就是说,当您使用 HTTP 协议时,您可以告诉客户端您正在发送压缩数据! 这样,您就不必解压数据,并且可以加载非常小的文件。

JAR 解决方案的主要缺点:只要服务器正在运行,您就无法更换 JAR。 因此,替换文件意味着您必须重新启动服务器。

You can try to put all the files in a JAR and put that on the classpath. Java uses some built-in tricks to make reading from a JAR file very fast. That will also keep the directory of all files in RAM, so you don't have to access the disk to find a file (that happens before you can start loading it).

The JVM won't load the whole JAR into RAM at once and you probably don't want that anyway because your machine would start swapping. But it will be able to find the pieces very quickly because it will keep the file open the whole time and therefore, you won't loose any time opening/closing the file resource.

Also, since you're using this single file all the time, chances are that the OS will keep it longer in the file caches.

Lastly, you can try to compress the JAR. While this may sound like a bad idea, you should give it a try. If the small files compress very well, the time to unpack with current CPUs is much lower than the time to read the data from the disk. If you don't have to keep the intermediate data anywhere, you can stream the uncompressed data to the client without needing to write to a file (which would ruin the whole idea). The drawback of this is that it does eat CPU cycles and if your CPU is busy (just check with some load tool; if it's above 20%, then you loose), then you will make the whole process slower.

That said, when you're using the HTTP protocol, you can tell the client that you're sending compressed data! This way, you don't have to unpack the data and you can load very small files.

Main disadvantage of the JAR solution: You can't replace the JAR as long as the server is running. So replacing a file means you will have to restart the server.

仲春光 2024-07-16 21:18:02

如果您需要快速访问 300,000 个文件,您可以使用数据库,不是关系数据库,而是简单的键值数据库,例如 http://www.space4j.org/。 这不会帮助您缩短启动时间,但在运行时可能会大大加快速度。

If you have 300,000 files that you need to access quickly you could use a database, not a relational one but a simple key-value one, like http://www.space4j.org/. This won't help your startup time, but is possibly quite a speed up during runtime.

有深☉意 2024-07-16 21:18:02

只是为了澄清一下,类 Unix 中的 mmap()系统不允许您访问此类文件; 它只是使文件的内容在内存中可用,作为内存。 您无法使用 open() 进一步打开任何包含的文件。 不存在“mmap()可用的文件集”这样的东西。

难道您不能只添加一个最初加载所有“模板”的通道,然后根据一些简单的东西(例如每个名称的哈希值)快速找到它们吗? 这应该可以让您充分利用您的记忆,并对任何模板进行 O(1) 访问。

Just to clarify, mmap() in Unix-like systems would not allow you to access files as such; it simply makes the contents of a file available in memory, as memory. You cannot use open() to further open any contained files. There is no such thing as a "mmap()able set of files".

Can't you just add a pass that loads all your "templates" initially, and then quickly finds them based on something simple, like a hash on the name of each? That should let you leverage your memory, and get down to O(1) access for any template.

秋千易 2024-07-16 21:18:02

我认为您仍在考虑旧的内存/磁盘模式。

mmap 在这里没有帮助,因为旧的内存/磁盘的东西早已不复存在。 如果您映射文件,内核将返回一个指向某些虚拟内存的指针,供您自行决定使用,它不会将文件加载到 >真实内存一次,当您请求文件的一部分时,它会这样做,并且它只会加载您请求的页面。 (也就是说,一个内存页,通常约为 4KB。)

你说这 300k 个文件,大约需要 1.5GB 到 2.5GB 的磁盘空间。 如果您有机会可以在服务器中添加 2(或更好,4)GB 的 RAM,那么您最好将磁盘读取任务留给操作系统,如果它有足够的 RAM在某些磁盘缓存中加载文件,它会从它们中读取它们,甚至不会命中磁盘。 (如果您没有使用 noatime 挂载卷,它将在 inode 中存储 atime。)

如果您尝试 read() 文件,将它们放入内存,并从那里提供它们,您现在可以确定地知道它们将始终位于RAM中,而不是在交换区中,因为操作系统对您一段时间未使用的那部分内存还有其他事情要做。

如果您有足够的 RAM 来让操作系统进行磁盘缓存,并且您确实希望加载文件,那么您始终可以执行一个小脚本/程序来遍历您的层次结构并读取所有文件。 (不做任何其他事情。)它会让操作系统将它们从磁盘加载到内存磁盘缓存,但您无法知道如果操作系统需要内存,它们会留在那里。 因此,正如我之前所说,您应该让操作系统处理该问题,并为其提供足够的 RAM 来执行此操作。

您应该阅读 varnish架构师笔记 其中 phk 用他自己的话告诉你,为什么你想要实现的目标最好留给操作系统,操作系统永远永远,更好地了解 JVM RAM 中包含哪些内容,哪些不在 RAM 中。

I think you're still thinking into the old memory/disk mode.

mmap won't help here because that old memory/disk thing is long gone. If you mmap a file, the kernel will give you back a pointer to some virtual memory for you to use at your own discretion, it will not load the file into real memory at once, it will do so when you'll ask for a part of the file, and it will load only the page(s) you're requesting. (That is, a memory page, something usually around 4KB.)

you say those 300k files, take about 1.5GB to 2.5GB of disk space. If there's any chance you can throw 2 (or better, 4) more gigabyte of RAM into your server, you would be very better with leaving that disk reading thing to the OS, if it has enough RAM to load files in some disk cache, it will, and from them, any read() on them, won't even hit the disk. (It will, to store atime in the inode if you've not mounted your volume with noatime.)

If you try to read() files, get them into memory, and serve them from there, you have now way to know for sure that they'll always be in RAM and not in the swap because the OS had other things to do with that part of the memory you've not used for a few time.

If you have enough RAM to let the OS do disk caching, and you really want the files to get loaded, you could always do a little script/program that will go through your hierarchy and read all the files. (Without doing anything else.) It will get the OS to load them from disk to a memory disk cache, but you have no way of knowing they'll stay there if the OS needs the memory. Thus what I said before, you should let the OS deal with that and give it enough RAM to do so.

You should read varnish's Architect Notes where phk tells you in his own words, why what you're trying to achieve is much better left of to the OS, which will always, ever, know better the JVM what's in RAM and what is not.

初见终念 2024-07-16 21:18:02

如果您需要快速访问所有这些文件,您可以将它们加载到内存中,但我不会将它们作为文件加载。 我会将这些数据放入某种对象结构中(以最简单的形式,只是一个字符串)。

我要做的是创建一个服务,该服务将文件作为对象结构从您正在使用的任何参数返回。 然后围绕该服务实现一些缓存机制。 然后就是调整缓存的问题了。 如果您确实需要将所有内容加载到内存中,请配置缓存以使用更多内存。 如果某些文件的使用量比其他文件多得多,则仅缓存这些文件可能就足够了……

如果我们更多地了解您想要实现的目标,我们可能会给您更好的答复。

If you need fast access to all those files, you could load them into memory, but I would not load them as files. I would put those data in some kind of an object structure (in the simplest form, just a String).

What I would do, is create a service that return the file as an object structure from whatever parameter you re using. Then implement some caching mechanism around this service. Then it's all a matter of tuning the cache. If you really need to load everything in memory, configure your cache to use more memory. If some files are used much more than others, it might be sufficient to cache just those ...

We could probably give you a better response if we knew more about what you are trying to achieve.

清风挽心 2024-07-16 21:18:02

将文件放在 10 个不同的服务器上,而不是直接处理请求,而是向客户端发送 HTTP 重定向(或等效内容)以及可以找到所需文件的 URL。 这允许分散负载。 服务器仅响应快速请求,并且(大量)下载分布在多台计算机上。

Put the files on 10 different servers and instead of directly serving the requests, send the client HTTP redirects (or an equivalent) with the URL where they can find the file they want. This allows to spread the load. The server just responds to quick requests and the (large) downloads are spread over several machines.

挖鼻大婶 2024-07-16 21:18:02

如果你使用的是 Linux,我会尝试一下老式的 RAM 磁盘。 您可以坚持当前的处理方式,并大幅降低 IO 成本。 您不受 JVM 内存的限制,仍然可以轻松替换内容。

当您谈论 VFS 时:它还有一个 RAM 磁盘提供程序,但我仍然会首先尝试本机 RAM 磁盘方法。

If you are on Linux I would give the good old RAM disk a try. You can stick with the current way of doing things and just drastically reduce IO costs. You are not bound to the JVM memory and can still easily replace the content.

As you were talking about VFS: that also has a RAM disk provider but I would still try the native RAM disk approach first.

宣告ˉ结束 2024-07-16 21:18:02

您需要的是加载哈希表中的所有信息。

使用文件名作为键、内容作为值来加载每个文件,您将能够比您想要的设置更快、更轻松地工作几个数量级。

What you need is to load all the information in a HashTable.

Load every file using it's name as the key, and the contents as the value, a you'll be able to work orders of magnitude faster and easier than the setup you have in mind.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文