SHA 足以检查文件重复吗? (PHP 中的 sha1_file)

发布于 2024-08-22 07:16:12 字数 360 浏览 8 评论 0原文

假设您想创建一个文件托管站点,供人们上传文件并向他们的朋友发送链接以供稍后检索,并且您想确保文件在我们存储文件的位置重复,那么 PHP 的 sha1_file 是否足以完成该任务?有什么理由不使用 md5_file 来代替吗?

对于前端,它将使用数据库中存储的原始文件名进行模糊处理,但还有一些额外的问题是这是否会泄露有关原始海报的任何信息。文件是否会继承任何元信息,例如上次修改或发布者,或者这些内容是否基于文件系统?

另外,使用盐是否无聊,因为彩虹表攻击方面的安全性对此没有任何意义,并且哈希值稍后可以用作校验和?

最后一件事,可扩展性?最初,它仅用于几兆大的小文件,但最终...

编辑1:散列的目的主要是避免文件重复,而不是造成模糊性。

Suppose you wanted to make a file hosting site for people to upload their files and send a link to their friends to retrieve it later and you want to insure files are duplicated where we store them, is PHP's sha1_file good enough for the task? Is there any reason to not use md5_file instead?

For the frontend, it'll be obscured using the original file name store in a database but some additional concerns would be if this would reveal anything about the original poster. Does a file inherit any meta information with it like last modified or who posted it or is this stuff based in the file system?

Also, is using a salt frivolous since security in regards of rainbow table attack mean nothing to this and the hash could later be used as a checksum?

One last thing, scalability? initially, it's only going to be used for small files a couple of megs big but eventually...

Edit 1: The point of the hash is primarily to avoid file duplication, not to create obscurity.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

不打扰别人 2024-08-29 07:16:12

sha1_file 足够好?

使用 sha1_file 就足够了,碰撞的可能性很小,但这种情况几乎永远不会发生。为了将比较文件大小的机会也减少到几乎为 0:

function is_duplicate_file( $file1, $file2)
{   
    if(filesize($file1) !== filesize($file2)) return false;

    if( sha1_file($file1) == sha1_file($file2) ) return true;

    return false;
}

md5 比 sha1 更快,但它生成的唯一输出较少,因此使用 md5 时发生冲突的机会仍然很小。

可扩展性?

有多种方法可以比较文件,使用哪种方法取决于您对性能的关注,我对不同的方法进行了小测试:

1-直接文件比较:

if( file_get_contents($file1) != file_get_contents($file2) )

2- Sha1_file

if( sha1_file($file1) != sha1_file($file2) )

3- md5_file

if( md5_file($file1) != md5_file($file2) )

结果:
2个文件,每个1.2MB,比较100次,我得到以下结果:

--------------------------------------------------------
 method                  time(s)           peak memory
--------------------------------------------------------
file_get_contents          0.5              2,721,576
sha1_file                  1.86               142,960
mdf5_file                  1.6                142,848

file_get_contents是最快的,比sha1快3.7,但它的内存效率不高。

Sha1_file 和 md5_file 是内存高效的,它们使用的内存大约是 file_get_contents 使用的内存的 5%。

md5_file 可能是一个更好的选择,因为它比 sha1 快一点。

所以结论是,这取决于您是否想要更快的比较或更少的内存使用。

sha1_file good enough?

Using sha1_file is mostly enough, there is a very small chance of collision, but that that will almost never happen. To reduce the chance to almost 0 compare file sizes too:

function is_duplicate_file( $file1, $file2)
{   
    if(filesize($file1) !== filesize($file2)) return false;

    if( sha1_file($file1) == sha1_file($file2) ) return true;

    return false;
}

md5 is faster than sha1 but it generates less unique output, the chance of collision when using md5 is still very small thought.

Scalability?

There are are several methods to compare files, which method to use depends on what your performance concerns are, I made small test on different methods:

1- Direct file compare:

if( file_get_contents($file1) != file_get_contents($file2) )

2- Sha1_file

if( sha1_file($file1) != sha1_file($file2) )

3- md5_file

if( md5_file($file1) != md5_file($file2) )

The results:
2 files 1.2MB each were compared 100 times, I got the following results:

--------------------------------------------------------
 method                  time(s)           peak memory
--------------------------------------------------------
file_get_contents          0.5              2,721,576
sha1_file                  1.86               142,960
mdf5_file                  1.6                142,848

file_get_contents was the fastest 3.7 faster than sha1, but it is not memory efficient.

Sha1_file and md5_file are memory efficient, they used about 5% of the memory used by file_get_contents.

md5_file might be a better option because it is a little faster than sha1.

So the conclusion is that it depends, if you want faster compare, or less memory usage.

执手闯天涯 2024-08-29 07:16:12

根据我对 @ykaganovich 的回答的评论,SHA1(令人惊讶地)比 MD5 稍快。

从您对问题的描述来看,您并不是试图创建一个安全散列 - 只是将文件隐藏在一个大的命名空间中 - 在这种情况下,使用盐/彩虹表是无关紧要的 - 唯一的考虑因素是错误冲突的可能性(其中 2 个不同的文件给出相同的哈希值)。 md5 发生这种情况的可能性非常非常小。有了 sha1 就更遥远了。然而,您确实需要考虑当 2 个独立用户将相同的软件上传到您的网站时会发生什么。谁拥有该文件?

事实上,似乎根本没有任何理由使用哈希 - 只需生成足够长的随机值即可。

As per my comment on @ykaganovich's answer, SHA1 is (surprisingly) slightly faster than MD5.

From your description of the problem, you are not trying to create a secure hash - merely hide the file in a large namespace - in which case use of a salt / rainbow tables are irrelevant - the only consideration is the likelihood of a false collision (where 2 different files give the same hash). The probability of this happening with md5 is very, very remote. It's even more remote with sha1. However you do need to think about what happens when 2 independent users upload the same warez to you site. Who owns the file?

In fact, there doesn't seem to be any reason at all to use a hash - just generate a sufficiently long random value.

眼泪淡了忧伤 2024-08-29 07:16:12

SHA 在任何“正常”环境中都应该表现得很好。尽管《Git Magic》的作者 Ben Lynn 是这么说的:

A.1。 SHA1 的弱点
随着时间的推移,密码学家发现了越来越多的 SHA1
弱点。已经在寻找哈希
对于资金充足的组织来说,碰撞是可行的。之内
几年,也许即使是一台典型的个人电脑也会

足够的计算能力来悄悄地破坏 Git 存储库。
希望 Git 在进一步迁移之前能够迁移到更好的哈希函数
研究破坏了 SHA1。

您始终可以检查 SHA256 或其他更长的算法。查找 MD5 冲突比使用 SHA1 更容易。

SHA should do just fine in any "normal" environment. Although this is what Ben Lynn - the author of "Git Magic" has to say:

A.1. SHA1 Weaknesses
As time passes, cryptographers discover more and more SHA1
weaknesses. Already, finding hash
collisions is feasible for well-funded organizations. Within
years, perhaps even a typical PC will
have
enough computing power to silently corrupt a Git repository.
Hopefully Git will migrate to a better hash function before further
research destroys SHA1.

You can always check SHA256, or others which are even longer. Finding MD5 collision is easier than with SHA1.

小清晰的声音 2024-08-29 07:16:12

两者都应该没问题。 sha1 是比 md5 更安全的哈希函数,这也意味着它更慢,这可能意味着您应该使用 md5 :)。如果文件非常小,您仍然希望使用 salt 来防止明文/彩虹攻击(不要假设人们决定上传到您的网站的内容)。性能差异可以忽略不计。只要您知道盐,您仍然可以将其用作校验和。

关于可扩展性,我猜你可能会受到 IO 限制,而不是 CPU 限制,所以我不认为计算校验和会给你带来很大的开销,尤其是。如果您在上传时在流上执行此操作。

Both should be fine. sha1 is a safer hash function than md5, which also means it's slower, which probably means you should use md5 :). You still want to use salt to prevent plaintext/rainbow attacks in case of very small files (don't make assumptions about what people decide to upload to your site). The performance difference will be negligible. You can still use it as a checksum as long as you know the salt.

With respect to scalability, I'd guess that you'll likely going to be IO-bound, not CPU-bound, so I don't think calculating the checksum would give you big overhead, esp. if you do it on the stream as it's being uploaded.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文