遍历分布式文件系统上的文件
我有一个包含几亿个文件(几拍字节)的文件系统,我想获得 stat 将返回的几乎所有内容并将其存储在某种数据库中。现在,我们有一个 MPI 程序,它从中央队列和工作节点提供目录名称,这些节点通过 stat 调用来攻击 NFS(它可以在不费力的情况下处理这个问题)。然后工作节点访问 postgres 来存储结果。
虽然这有效,但速度非常慢。在现代 30 节点集群上,单次运行将需要 24 小时以上。
有没有人有任何想法来分割目录结构而不是使用集中式队列(我的印象是,这方面的精确算法是 NP 困难的)?另外,我一直在考虑用 MongoDB 的带有多个路由器的自动分片之类的东西来替换 postgres(因为 postgres 目前是一个巨大的瓶颈)。
我基本上只是在寻找有关如何改进此设置的一般想法。
不幸的是,使用像 2.6 内核审计子系统这样的东西可能是不可能的,因为要让它在每台访问该文件系统的机器上运行是极其困难的(以政治方式)。
如果重要的话,使用此文件系统的每台机器(几千台)都运行 linux 2.6.x。
这样做的实际主要目的是查找早于特定日期的文件,以便我们能够删除它们。我们还想收集有关文件系统如何使用的一般数据。
I have a filesystem with a few hundred million files (several petabytes) and I want to get pretty much everything that stat would return and store it in some sort of database. Right now, we have an MPI program that is fed directory names from a central queue and worker nodes that slam NFS (which can handle this without trying too hard) with stat calls. The worker nodes then hit postgres to store the results.
Although this works, it's very slow. A single run will take over 24 hours on a modern 30 node cluster.
Does anyone have any ideas for splitting up the directory structure instead of having a centralized queue (I'm under the impression that exact algorithms for this are NP hard)? Also, I've been considering replacing postgres with something like MongoDB's autosharding with several routers (since postgres is currently a huge bottleneck).
I'm pretty much just looking for ideas in general on how this setup could be improved.
Unfortunately, using something like the 2.6 kernel audit subsystem is probably out of the question since it would be extremely difficult (in a political way) to get that running on every machine that hits this filesystem.
If it matters, every machine (several thousand) using this filesystem is running linux 2.6.x.
The actual primary purpose of this is to find files that are older than a certain date so we can have the ability to delete them. We also want to collect data in general on how the filesystem is being used.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
扩展我的评论。
将文件放在中央位置是最大的瓶颈之一。如果您无法通过其他方式优化文件系统访问时间,那么最好的方法可能是让一个(或几个)工作人员执行
stat
调用。通过添加多个工作线程不会提高性能,因为它们都访问相同的文件系统。因此,我认为将工作进程放在文件系统所在的节点上(而不是通过 NFS 访问它)应该会给您带来巨大的性能提升。
另一方面,可以通过更改数据库引擎来优化数据库写入。正如评论中所预期的那样,Redis 键值模型更适合此类任务(是的,它 相当快):你可以使用它的 哈希值type 使用完整路径名作为键来存储
stat
调用的结果。此外,redis 还将在不久的将来支持集群。
Expanding on my comments.
Having the files in a central location is one of the biggest bottlenecks. If you can't optimize the filesystem access times in other ways, probably the best way to do is to have one (or a couple) of workers doing the
stat
calls. You will not have performance improvements by adding more than a couple of workers, because they are all accessing the same filesystem.Because of this, I think that putting the workers on the node where the filesystem is located (instead of accessing it through NFS) should give you a great performance boost.
On the other side, the database writes can be optimized by changing your db engine. As anticipated in the comments, Redis key-value model is better suited for such a task (yes, it is pretty fast): you can use its hash type to store the result of the
stat
call using the full pathname as the key.Additionally, redis will also support clustering in the near future.
我们最终为此创建了自己的解决方案(使用 redis)。我们已将运行时间从大约 24 小时缩短到大约 2.5 小时。
We ended up creating our own solution in the end for this (using redis). We've brought the time down from about 24 hours for running it to about 2.5 hours.