用于存储文件的 SHA-1 哈希值
阅读此内容后,这听起来像是使用目录的 SHA-1 存储文件是个好主意。
我不知道这意味着什么,但我只知道 SHA-1 和 MD5 是哈希算法。如果我使用 这个 ruby 脚本 计算 SHA-1 哈希,并且我更改文件的内容(改变散列),我怎么知道文件存储在哪里?
那么我的问题是,实现 SHA-1/文件存储系统的基础知识是什么?
如果所有文件都一直在更改内容,是否有更好的解决方案来存储它们,或者您只需要不断更新哈希值?
我只是在考虑如何创建一个通用的文件存储系统,例如 GoogleDocs、Flickr、Youtube、DropBox 等,您可以在不同的环境中重用它(例如存储 PubMed 期刊文章或 Cramster< /a> 家庭作业和测试,或者只是像 Flickr 上的图像)。我可能会将它们存储在 Amazon EC2 上。只是一些系统,这样我就可以说“从现在开始,我将在 99% 的时间里进行文件存储”,这样我就可以不再考虑构建可靠/一致的方式来存储文件并解决一些实际问题。
After reading this, it sounds like a great idea to store files using the SHA-1 for the directory.
I have no idea what this means however, all I know is that SHA-1 and MD5 are hashing algorithms. If I calculate the SHA-1 hash using this ruby script, and I change the file's content (which changes the hash), how do I know where the file is stored then?
My question is then, what are the basics of implementing a SHA-1/file-storage system?
If all of the files are changing content all the time, is there a better solution for storing them, or do you just have to keep updating the hash?
I'm just thinking about how to create a generic file storing system like GoogleDocs, Flickr, Youtube, DropBox, etc., something that you could reuse in different environments (such as storing PubMed journal articles or Cramster homework assignments and tests, or just images like on Flickr). I'd probably store them on Amazon EC2. Just some system so I can say "this is how I'll 99% of the time do file storing from now on", so I can stop thinking about building a solid/consistent way to store files and get onto some real problems.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
首先,如果文件的内容发生变化,SHA-digest 方法的文件名不太合适,因为当文件内容发生变化时,文件系统中文件的名称和位置也必须发生变化。
基本上,您首先根据文件内容计算 SHA-1 或 MD5 摘要(= 哈希值)。
当您有摘要时,例如
00e4f56c0de1c61fdb926e79e8a0a65bd12930c9
,您可以从摘要中生成文件位置和文件名。例如,您将摘要中的前几个字符拆分为目录结构,并将其余字符拆分为文件名。例如:这样您只需将文件的 SHA-1 摘要存储到数据库即可。然后您始终可以找到文件的正确位置和名称。
目录通常还具有可包含的最大文件数,例如每个目录最多 32000 个子目录和文件。基于这种散列的目录结构使得您不太可能将太多文件存储到同一目录中。还使用这样的散列确保每个目录具有大约相同数量的文件,您不会遇到所有文件都在同一目录中的情况。
First of all, if the contents of the files are changing, filename from SHA-digest approach is not very suitable, because the name and location of the file in filesystem must change when the contents of the file changes.
Basically you first compute a SHA-1 or MD5 digest (= hash value) from the contents of the file.
When you have a digest, for example,
00e4f56c0de1c61fdb926e79e8a0a65bd12930c9
, you generate a file location and filename from the digest. For example, you split the first few characters from the digest to directory structure and rest of the characters to file name. For example:This way you only need to store the SHA-1 digest of the file to database. You can then always find out the right location and the name of the file.
Directories usually also have maximum number of files they can contain, for example maximum of 32000 subdirectories and files per directory. A directory structure based on this kind of hashing makes it unlikely that you store too many files to same directory. Also using hashing like this make sure that every directory has about the same number of files, you won't get into situation where all your files are in same directory.
这个想法不是通过使用哈希值来更改文件内容,而是更改其名称(和路径)。
使用哈希值更改内容将是灾难性的,因为哈希值通常是不可逆的。
我不确定使用哈希而不是文件名(甚至不是长随机数)的动机,但这里有哈希方法的一些优点
a) 猜测文件名
b)对图片进行分类(有人会窃取硬盘内容)
使用哈希的一般兴趣在于,与文件名不同,哈希是没有意义的,因此需要数据库将图像和“书目”类型数据(上传者姓名、上传日期、标签)关联起来 ,...)
在思考它时,重新阅读引用的 SO 响应,与随机数相比,我真的没有看到哈希有多大优势...
此外...一些哈希产生一个数值,通常以十六进制表示(如所引用的 SO 问题所示),这可能被视为浪费,因为文件名比所需的长度长,因此给文件系统带来了更大的压力(更大的目录) ...)
The idea is not to change the file content, but rather its name (and path), by using a hash value.
Changing the content with a hash would be disastrous since a hash is normally not reversible.
I'm not sure of the motivivation for using a hash rather than the file name (or even rather than a long random number), but here are a few advantages of the hash appraoch:
a) guess a file name
b) categorize pictures (would someone steal the hard drive content)
The general interest of using a hash is that unlike a file name, a hash is meaningless, and therefore one would require the database to relate images and "bibliographic" type data (name of uploader, date of upload, tags, ...)
In thinking about it, re-reading the referenced SO response, I don't really see much of an advantage of a hash, as compared to, say, a random number...
Furthermore... some hashes produce a numeric value, typically expressed in hexadecimal (as seen in the refernced SO question) and this could be seen as wasteful, by making the file names longer than they need to be, and hence putting more stress on the file system (bigger directories...)
我认为使用哈希值存储文件的一个优点是文件数据只需要存储一次,然后可以在数据库中多次引用。如果您有不同的用户上传完全相同的文件,这将为您节省空间。
然而,这样做的缺点是,当用户从您的应用程序中删除他们认为存在的文件时,您不能只是从磁盘上物理删除该文件,因为上传相同文件的其他用户可能仍在使用它。
One advantage I see with storing files using their hash is that the file data only needs to be stored once and then can be referenced multiple times within your database. This will save you space if you have a different users uploading the exact same file.
However the downside to this is when a user deletes what they think is there file from your app, you can't just physically delete the file from disk because other users that uploaded the same exact file may still be using it.
这个想法是,您需要为照片起一个名称,并且您可能希望将文件分散在多个目录中。提出唯一名称的一种简单方法是使用哈希。
因此,散列的开头被剥离以形成多级目录结构,散列的其余部分用于 jpg 的文件名。
这具有检测重复上传的额外好处。
The idea is that you need to come up with a name for the photo, and you probably want to scatter the files among a number of directories. One easy way to come up with a unique name is to use the hash.
So the beginning of the hash was peeled off for a multi-level directory structure and the rest of the hash was used for a filename for the jpg.
This has the additional benefit of detecting duplicate uploads.