用于存储文件的 SHA-1 哈希值

发布于 2024-08-12 10:10:15 字数 829 浏览 15 评论 0原文

阅读此内容后，这听起来像是使用目录的 SHA-1 存储文件是个好主意。

我不知道这意味着什么，但我只知道 SHA-1 和 MD5 是哈希算法。如果我使用这个 ruby 脚本计算 SHA-1 哈希，并且我更改文件的内容（改变散列），我怎么知道文件存储在哪里？

那么我的问题是，实现 SHA-1/文件存储系统的基础知识是什么？

如果所有文件都一直在更改内容，是否有更好的解决方案来存储它们，或者您只需要不断更新哈希值？

我只是在考虑如何创建一个通用的文件存储系统，例如 GoogleDocs、Flickr、Youtube、DropBox 等，您可以在不同的环境中重用它（例如存储 PubMed 期刊文章或 Cramster< /a> 家庭作业和测试，或者只是像 Flickr 上的图像）。我可能会将它们存储在 Amazon EC2 上。只是一些系统，这样我就可以说“从现在开始，我将在 99% 的时间里进行文件存储”，这样我就可以不再考虑构建可靠/一致的方式来存储文件并解决一些实际问题。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

少女的英雄梦 2024-08-19 10:10:15

首先，如果文件的内容发生变化，SHA-digest 方法的文件名不太合适，因为当文件内容发生变化时，文件系统中文件的名称和位置也必须发生变化。

基本上，您首先根据文件内容计算 SHA-1 或 MD5 摘要（= 哈希值）。

当您有摘要时，例如 00e4f56c0de1c61fdb926e79e8a0a65bd12930c9，您可以从摘要中生成文件位置和文件名。例如，您将摘要中的前几个字符拆分为目录结构，并将其余字符拆分为文件名。例如：

 00e4f56c0de1c61fdb926e79e8a0a65bd12930c9 => some/path/00/e4/f5/6c0de1c61fdb926e79e8a0a65bd12930c9.txt

这样您只需将文件的 SHA-1 摘要存储到数据库即可。然后您始终可以找到文件的正确位置和名称。

目录通常还具有可包含的最大文件数，例如每个目录最多 32000 个子目录和文件。基于这种散列的目录结构使得您不太可能将太多文件存储到同一目录中。还使用这样的散列确保每个目录具有大约相同数量的文件，您不会遇到所有文件都在同一目录中的情况。

First of all, if the contents of the files are changing, filename from SHA-digest approach is not very suitable, because the name and location of the file in filesystem must change when the contents of the file changes.

Basically you first compute a SHA-1 or MD5 digest (= hash value) from the contents of the file.

When you have a digest, for example, 00e4f56c0de1c61fdb926e79e8a0a65bd12930c9, you generate a file location and filename from the digest. For example, you split the first few characters from the digest to directory structure and rest of the characters to file name. For example:

 00e4f56c0de1c61fdb926e79e8a0a65bd12930c9 => some/path/00/e4/f5/6c0de1c61fdb926e79e8a0a65bd12930c9.txt

This way you only need to store the SHA-1 digest of the file to database. You can then always find out the right location and the name of the file.

Directories usually also have maximum number of files they can contain, for example maximum of 32000 subdirectories and files per directory. A directory structure based on this kind of hashing makes it unlikely that you store too many files to same directory. Also using hashing like this make sure that every directory has about the same number of files, you won't get into situation where all your files are in same directory.

回复收藏 0 原文