快速访问包含 500,000 个文件的目录

发布于 2024-07-08 14:40:31 字数 322 浏览 11 评论 0原文

我有一个包含 500,000 个文件的目录。 我想尽快访问它们。 该算法要求我重复打开和关闭它们(不能同时打开 500,000 个文件)。

我怎样才能有效地做到这一点? 我原本以为我可以缓存 inode 并以这种方式打开文件,但是 *nix 没有提供通过 inode 打开文件的方法(安全性或类似的)。

另一种选择是不担心它并希望 FS 在目录中的文件查找方面做得很好。 如果这是最好的选择,那么哪种 FS 效果最好。 某些文件名模式的查找速度是否比其他模式快? 例如 01234.txt 与 foo.txt

顺便说一句,这都是在 Linux 上的。

I have a directory with 500,000 files in it. I would like to access them as quickly as possible. The algorithm requires me to repeatedly open and close them (can't have 500,000 file open simultaneously).

How can I do that efficiently? I had originally thought that I could cache the inodes and open the files that way, but *nix doesn't provide a way to open files by inode (security or some such).

The other option is to just not worry about it and hope the FS does good job on file look up in a directory. If that is the best option, which FS's would work best. Do certain filename patterns look up faster than others? eg 01234.txt vs foo.txt

BTW this is all on Linux.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

三生池水覆流年 2024-07-15 14:40:32

假设您的文件系统是 ext3,如果 dir_index 为,则您的目录将使用散列 B 树进行索引已启用。 这将为您提供与您可以在应用程序中编码的任何内容一样多的提升。

如果目录已建立索引,则您的文件命名方案应该无关紧要。

http://lonesysadmin.net/2007 /08/17/use-dir_index-for-your-new-ext3-filesystems/

Assuming your file system is ext3, your directory is indexed with a hashed B-Tree if dir_index is enabled. That's going to give you as much a boost as anything you could code into your app.

If the directory is indexed, your file naming scheme shouldn't matter.

http://lonesysadmin.net/2007/08/17/use-dir_index-for-your-new-ext3-filesystems/

遇到 2024-07-15 14:40:32

有几个想法:

a) 如果您可以控制目录布局,则将文件放入子目录中。

b)如果您无法移动文件,那么您可以尝试不同的文件系统,我认为 xfs 可能适合具有大量条目的目录?

A couple of ideas:

a) If you can control the directory layout then put the files into subdirectories.

b) If you can't move the files around, then you might try different filesystems, I think xfs might be good for directories with lots of entries?

沩ん囻菔务 2024-07-15 14:40:32

如果你有足够的内存,你可以使用 ulimit 来增加你的进程一次可以打开的最大文件数,我已经成功地完成了 100,000 个文件,500,000 个文件也应该可以工作。

如果这不适合您,请尝试确保您的目录项缓存有足够的空间来存储所有条目。 dentry缓存是文件名-> 内核使用 inode 映射来加速基于文件名的文件访问,访问大量不同的文件可以有效消除 dentry 缓存的优势,并带来额外的性能损失。 Stock 2.6 内核的哈希值一次最多可容纳 256 * MB RAM 条目,如果您有 2GB 内存,则最多可以处理 500,000 个以上的文件。

当然,请确保执行适当的分析以确定这是否确实导致了瓶颈。

If you've got enough memory, you can use ulimit to increase the maximum number of files that your process can have open at one time, I have successfully done with with 100,000 files, 500,000 should work as well.

If that isn't a option for you, try to make sure that your dentry cache has enough room to store all the entries. The dentry cache is the filename -> inode mapping that the kernel uses to speed up file access based on filename, accessing huge numbers of different files can effectively eliminate the benefit of the dentry cache as well as introduce an additional performance hit. Stock 2.6 kernel has a hash with up to 256 * MB RAM entries in it at a time, if you have 2GB of memory you should be okay for up to a little over 500,000 files.

Of course, make sure you perform the appropriate profiling to determine if this really causes a bottlneck.

浅忆 2024-07-15 14:40:32

传统的方法是使用散列子目录。 假设您的文件名都是均匀分布的哈希值,以十六进制编码。 然后,您可以根据文件名的前两个字符创建 256 个目录(例如,文件 012345678 将被命名为 01/2345678)。 如果一层不够,您可以使用两层甚至更多层。

只要文件名均匀分布,这将使目录大小易于管理,从而使对它们的任何操作更快。

The traditional way to do this is with hashed subdirectories. Assume your file names are all uniformly-distributed hashes, encoded in hexadecimal. You can then create 256 directories based on the first two characters of the file name (so, for instance, the file 012345678 would be named 01/2345678). You can use two or even more levels if one is not enough.

As long as the file names are uniformly distributed, this will keep the directory sizes manageable, and thus make any operations on them faster.

聽兲甴掵 2024-07-15 14:40:32

另一个问题是文件中有多少数据? SQL 后端是一个选项吗?

Another question is how much data is in the files? Is an SQL back end an option?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文