检查数据库表中引用的文件的物理存在
我们有一个相当大的表,其中包含文档信息以及指向文件系统上的文件的文件路径。 几年后,我们注意到磁盘上有一些文件未在数据库表中引用,反之亦然。
由于目前我正在学习 Clojure,我认为制作一个可以查找数据库和文件系统之间差异的小实用程序会很好。当然,由于我是初学者,我陷入了困境,因为有超过 600 000 个文档,显然我需要一些性能更高、内存消耗更少的解决方案:)
我的第一个想法是生成包含所有文件的扁平文件系统树列表,并将其与列表进行比较从数据库中,如果文件不存在,则放入单独的列表“不存在”中,如果某些文件存在于 HDD 上而不是数据库中,则将其移动到某个转储目录。
有什么想法吗?
We have one rather large table containing documents info together with filepaths pointing to files on file system.
After couple of years we noticed that we have files on the disk which are not referenced in DB table and vice-versa.
Since currently I'm learning Clojure I tought it would be nice to make small utility which can find diff between db and file system. Naturally, since i'm beginner I got stucked because there's more than 600 000 documents and obviously I need some more performant and less memory consuming solution :)
My first idea was to generate flatten filesystem tree list with all files, and compare it with list from db, if file doesn't exist put in separate list "non-existing" and if some file exists on HDD and not in DB, move it to some dump directory.
Any ideas?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
作为一个草图,以下是您如何根据数据库检查文件系统,以您满意的大小为单位:
不使用
printf
,而是针对文件列表进行某种数据库检查。As a sketch, here's how you could check the filesystem against the database, in chunks of whatever size you're happy with:
Instead of using
printf
, do some kind of database checks against the list of files.我建议根据您对性能与内存的偏好选择以下三个选项之一:
内存密集型:使用调用 File.listFiles 的递归方法将所有文件放入列表中。然后将列表与数据库进行比较。
IO 密集型解决方案:针对数据库一次递归地检查每个文件。
中间解决方案:读取一个目录中的所有文件,将它们与数据库进行比较。在任何子目录上递归并重复。与选项 1 具有相同数量的 IO 调用,但每次仅在内存中保存一个分支 + 一个目录的文件路径。
I would suggest one of three options depending on your preference for performance vs. memory:
Memory intensive: Use a recursive method calling File.listFiles to put all the files into a list. Then compare the list against your DB.
IO intensive solution: Recursively check each file one at a time against the DB.
Intermediate solution: read all the files in one dir, compare them against the DB. Recurse on any sub-dirs and repeat. Has the same number of IO calls as option 1 but only holds one branch + one dir worth of file paths in memory at any one time.