为 OS X 构建重复数据删除应用程序,我应该使用什么/如何作为文件的哈希值

发布于 12-18 19:02 字数 620 浏览 5 评论 0原文

我即将踏上编程之旅,这无疑会以失败和/或鼠标穿过我的 Mac 告终,但这是一个有趣的问题。

我想构建一个应用程序,它从某个基本目录开始扫描并递归地循环遍历每个文件,如果它找到一个完全相同的重复文件,它将删除它,并在其位置创建一个符号链接。基本上是穷人的重复数据删除。这实际上为我解决了一个真正的问题,因为我的 Mac 上有一堆重复的文件,我需要释放磁盘空间。

根据我的阅读,这是策略:

  1. 递归循环,并为每个文件生成一个哈希值。哈希值必须非常唯一。这是第一个问题。我应该使用什么哈希?如何通过这个神奇的哈希来运行每个文件的整个二进制内容?

  2. 将每个文件的哈希值和完整路径存储在键/值存储中。我认为 Redis 因其速度而非常适合。

  3. 迭代键/值存储,查找重复的哈希值,删除重复文件,创建符号链接,并将键/值存储中的行标记为副本。

因此,我的问题是:

  • 我应该对每个文件使用什么哈希算法?这是怎么做到的?
  • 我正在考虑使用 Node.js,因为 Node 通常在 I/O 类型的事情上速度很快。问题是节点会消耗 CPU 密集型的东西,因此散列可能会成为瓶颈。
  • 我在这里还缺少哪些其他问题?

I am about to embark on a programming journey, which undoubtedly will end in failure and/or throwing my mouse through my Mac, but it's an interesting problem.

I want to build an app, which scans starting at some base directory and recursively loops down through each file, and if it finds an exact duplicate file, it deletes it, and makes a symbolic link in its place. Basically poor mans deduplication. This actually solves a real problem for me, since I have a bunch of duplicate files on my Mac, and I need to free up disk space.

From what I have read, this is the strategy:

  1. Loop through recursively, and generate a hash for each file. The hash need to be extremely unique. This is the first problem. What hash should I use? How do I run the entire binary contents of each file through this magical hash?

  2. Store each files hash and full-path in a key/value store. I'm thinking redis is an excellent fit because of its speed.

  3. Iterate through the key/value store, find duplicate hashes, delete the duplicate file, create the symbolic link, and flag the row in the key/value store as a copy.

My questions therefore are:

  • What hashing algorithm should I use for each file? How is this done?
  • I'm thinking about using node.js because node generally is fast at i/o types of things. The problem is that node sucks at CPU intensive stuff, so the hashing will probably be the bottleneck.
  • What other gotchas am I missing here?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

情域2024-12-25 19:02:50

我应该对每个文件使用什么哈希算法?这是如何完成的?

使用 SHA1。 Git 使用 SHA1 为文件生成唯一的哈希值。几乎不可能发生碰撞。 标准 SHA1 没有已知的冲突

我正在考虑使用node.js,因为node通常在I/O类型的事情上速度很快。问题是节点会消耗 CPU 密集型的东西,因此散列可能会成为瓶颈。

您的应用程序将有两种操作:

  • 读取文件(IO 绑定)。
  • 计算哈希值(CPU 限制)。

我的建议是:不要使用脚本语言(Ruby 或 JavaScript)计算哈希,除非它具有本机哈希库。您可以只调用其他可执行文件,例如 sha1sum。它是用 C 语言编写的,速度应该非常快。

我认为你不需要 NodeJS。 NodeJS 在事件驱动 IO 方面速度很快,但它无法提高您的 I/O 速度。我认为你不需要在这里实现事件驱动IO。

我还缺少哪些其他陷阱?

我的建议:用你熟悉的语言来实现即可。不要过早过度设计。仅当您确实遇到性能问题时才优化它。

What hashing algorithm should I use for each file? How is this done?

Use SHA1. Git uses SHA1 to generate unique hash for files. It's almost impossible to have a collision. There is no known collision of standard SHA1.

I'm thinking about using node.js because node generally is fast at i/o types of things. The problem is that node sucks at CPU intensive stuff, so the hashing will probably be the bottleneck.

Your application will have 2 kinds of operation:

  • Reading file (IO bound).
  • Calculating hash (CPU bound).

My suggestion is: don't calculate hash in scripting language (Ruby or JavaScript) unless it has native hashing library. You can just invoke other executables such as sha1sum. It's written in C and should be blazing fast.

I don't think you need NodeJS. NodeJS is fast in event-driven IO, but it cannot boost your I/O speed. I don't think you need to implement event-driven IO here.

What other gotchas am I missing here?

My suggestions: Just implement with a language which you are familiar with. Don't over-engineering too early. Optimize it only when you really hit performance issue.

恋竹姑娘2024-12-25 19:02:50

有点晚了,但我听从了 miaout 的建议并想出了这个……

var exec = require('child_process').exec;
exec('openssl sha1 "'+file+'"', { maxBuffer: (200*10240) }, function(p_err, p_stdout, p_stderr) {
  var myregexp = /=\s?(\w*)/g;
  var match = myregexp.exec(p_stdout);
  fileInfo.hash = "Fake hash";
  if (match != null) {
    fileInfo.hash = match[1];
  }
  next()
});

你可以使用 sha1sum,但像其他所有优秀的软件一样,它需要安装诸如自制程序之类的东西。当然如果你有环境的话也可以自己编译。

A little late but I used miaout's advice and came up with this...

var exec = require('child_process').exec;
exec('openssl sha1 "'+file+'"', { maxBuffer: (200*10240) }, function(p_err, p_stdout, p_stderr) {
  var myregexp = /=\s?(\w*)/g;
  var match = myregexp.exec(p_stdout);
  fileInfo.hash = "Fake hash";
  if (match != null) {
    fileInfo.hash = match[1];
  }
  next()
});

You could use sha1sum but like every other great piece of software it will require something like homebrew to install. Of course you could also compile it yourself if you have the environment for it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文