依靠哈希值进行文件识别有多安全?
我正在 LAMP 堆栈之上设计一个存储云软件。
文件可以有一个内部ID,但是在服务器文件系统中不使用递增的ID作为文件名,而是使用哈希作为文件名来存储它们会有很多优点。
此外,如果当前集中式数据库应该分片或去中心化,或者应该建立某种主主高可用性环境,则哈希作为数据库中的标识符将具有很多优势。但我还不确定。
客户端可以将文件存储在任何字符串下(通常是某种路径和文件名)。
这个字符串保证是唯一的,因为在第一层是类似“存储桶”的东西,用户可以像在 Amazon S3 和 Google 存储中一样注册。
我的计划是将文件存储为客户端定义路径的哈希值。
这样,存储服务器可以直接提供文件,而不需要数据库询问它是哪个 ID,因为它可以动态计算哈希值和文件名。
但我害怕碰撞。我目前正在考虑使用 SHA1 哈希值。
我听说 GIT 也使用哈希值和修订标识符。
我知道碰撞的可能性确实非常低,但也是有可能的。
我只是无法判断这一点。您是否会依赖哈希来实现此目的?
我还可以对路径编码进行一些标准化。也许将 base64 作为文件名,但我真的不希望这样,因为它可能会变得混乱,路径可能会变得太长,并且可能会出现其他复杂情况。
I am designing a storage cloud software on top of a LAMP stack.
Files could have an internal ID, but it would have many advantages to store them not with an incrementing id as filename in the servers filesystems, but using an hash as filename.
Also hashes as identifier in the database would have a lot of advantages if the currently centralized database should be sharded or decentralized or some sort of master-master high availability environment should be set up. But I am not sure about that yet.
Clients can store files under any string (usually some sort of path and filename).
This string is guaranteed to be unique, because on the first level is something like "buckets" that users have go register like in Amazon S3 and Google storage.
My plan is to store files as hash of the client side defined path.
This way the storage server can directly serve the file without needing the database to ask which ID it is because it can calculate the hash and thus the filename on the fly.
But I am afraid of collisions. I currently think about using SHA1 hashes.
I heard that GIT uses hashes also revision identifiers as well.
I know that the chances of collisions are really really low, but possible.
I just cannot judge this. Would you or would you not rely on hash for this purpose?
I could also us some normalization of encoding of the path. Maybe base64 as filename, but i really do not want that because it could get messy and paths could get too long and possibly other complications.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
假设您有一个具有“完美”属性的哈希函数,并假设加密哈希函数方法适用的理论与适用于生日攻击的理论相同。这意味着,给定最大文件数,您可以通过使用更大的哈希摘要大小来使冲突概率尽可能小。 SHA 有 160 位,因此对于任何实际数量的文件,冲突的概率几乎为零。如果您查看链接中的表格,您会发现包含 10^10 个文件的 128 位哈希的冲突概率为 10^-18 。
只要概率足够低,我认为解决方案就是好的。与行星被小行星撞击、磁盘驱动器中无法检测到的错误、内存中的位翻转等的概率相比 - 只要这些概率足够低,我们就不必担心它们,因为它们“永远不会”发生。只要留出足够的余量并确保这不是最薄弱的环节即可。
需要关注的一件事是哈希函数的选择及其可能的漏洞。是否有任何其他身份验证,或者用户是否只是提供路径并检索文件?
如果您考虑攻击者试图暴力破解上述场景,他们需要请求 2^18 个文件,然后才能获取系统中存储的其他随机文件(再次假设 128 位哈希和 10^10 个文件,您将得到文件少得多,哈希值更长)。 2^18 是一个相当大的数字,暴力破解的速度受到网络和服务器的限制。一个简单的在 x 次尝试后锁定用户的策略可以完全弥补这个漏洞(这就是许多系统实施此类策略的原因)。构建一个安全的系统很复杂,需要考虑很多点,但这种方案可以是完全安全的。
希望这有用...
编辑:思考这个问题的另一种方式是,实际上每个加密或身份验证系统都依赖于某些安全概率非常低的事件。例如,我可能很幸运,猜到了 512 位 RSA 密钥的素因数,但系统不太可能被认为非常安全。
Assuming you have a hash function with "perfect" properties and assuming cryptographic hash functions approach that the theory that applies is the same that applies to birthday attacks . What this says is that given a maximum number of files you can make the collision probability as small as you want by using a larger hash digest size. SHA has 160 bits so for any practical number of files the probability of collision is going to be just about zero. If you look at the table in the link you'll see that a 128 bit hash with 10^10 files has a collision probability of 10^-18 .
As long as the probability is low enough I think the solution is good. Compare with the probability of the planet being hit by an asteroid, undetectable errors in the disk drive, bits flipping in your memory etc. - as long as those probabilities are low enough we don't worry about them because they'll "never" happen. Just take enough margin and make sure this isn't the weakest link.
One thing to be concerned about is the choice of the hash function and it's possible vulnerabilities. Is there any other authentication in place or does the user simply present a path and retrieve a file?
If you think about an attacker trying to brute force the scenario above they would need to request 2^18 files before they can get some other random file stored in the system (again assuming 128 bit hash and 10^10 files, you'll have a lot less files and a longer hash). 2^18 is a pretty big number and the speed you can brute force this is limited by the network and the server. A simple lock the user out after x attempts policy can completely close this hole (which is why many systems implement this sort of policy). Building a secure system is complicated and there will be many points to consider but this sort of scheme can be perfectly secure.
Hope this is useful...
EDIT: another way to think about this is that practically every encryption or authentication system relies on certain events having very low probability for its security. e.g. I can be lucky and guess the prime factor on a 512 bit RSA key but it is so unlikely that the system is considered very secure.
虽然冲突的可能性可能微乎其微,但想象一下,仅仅因为发生哈希冲突,就将一个客户的高度机密文件提供给其竞争对手。
= 业务结束
我宁愿对发生冲突时不太重要的事情使用散列;-)
如果您有数据库,请将文件存储在 GUID 下 - 所以不是递增索引,而是适当的全局唯一标识符。当涉及到分布式分片/高可用性等时,它们工作得很好。
想象一下最坏的情况,并假设它会在你被《连线》杂志报道为一家令人惊叹的初创公司后一周发生……这对算法来说是一个很好的压力测试。
Whilst the probability of a collision might be vanishingly small, imagine serving a highly confidential file from one customer to their competitor just because there happens to be a hash collision.
= end of business
I'd rather use hashing for things that were less critical when collisions DO occur ;-)
If you have a database, store the files under GUIDs - so not an incrementing index, but a proper globally unique identifier. They work nicely when it comes to distributed shards / high availability etc.
Imagine the worst case scenario and assume it will happen the week after you are featured in wired magazine as an amazing startup ... that's a good stress test for the algorithm.