哈希码与 SHA-1

发布于 2024-07-20 12:45:23 字数 325 浏览 4 评论 0原文

我想比较一些代表树的大型对象并缓存某些东西以避免每次将新对象与已经存在的对象进行比较...

问题是什么是最好的东西? (性能和碰撞之间的折衷......)。

一方面,我有一个基于各个字段的值的常规 hashCode 函数(遵循

另一方面,我有来自标准java发行版的带有SHA-1算法的MessageDigest方法。我认为它是效率不会很高,但我的冲突可能会减少。在我的上下文中这是正确的解决方案还是完全错误?

问题是我不知道对象的大小是多少。请注意,计算出的值不会在哈希表中使用,

谢谢...

I'd like to compare some large objects representing trees and cache something to avoid comparing each time the new object with one already existing...

The question is what would be the best something ? (a compromise between performance and collisions...).

On the one hand, I have a regular hashCode function based on the value of various fields (following the chapter 3 of effective Java. But I'm not able to evaluate the potential collisions entailed by such an approach.

On the other hand, I have the MessageDigest approach from the standard java distribution with SHA-1 algorithm. I presume it's not going to be efficient but I may have less collision. Am I right ? Is it a correct solution in my context or am I completely wrong ?

The thing is that I don't know what would be the size of the objects. Please also note that the value computed is not going to be used in a HashTable.

thx...

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

白云不回头 2024-07-27 12:45:23

请参阅以下内容:

请记住以下几点:

  • 一个对象可能是不相等,但具有相同的哈希码
  • 您的碰撞可能性取决于您遇到的对象数量。
  • 哈希码的有用程度取决于您如何实现检查。

通常,您可以根据预期对象的数量和可能的哈希数(最大哈希值)来确定冲突的可能性。 请参阅 http://en.wikipedia.org/wiki/Birthday_paradox 了解详细说明。

亲自? Java对象(实例化类)< 一万? 哈希码。 代表文件/blob/大量数据? SHA-1。 我在数据库中使用 SHA-1 哈希来防止人们多次对同一文件进行 ETL 工作。 然后,我在第二级再次使用 SHA-1 散列,以防止人们在多个文件中对同一部分进行 ETL(例如,不同的文件但相同的顺序出现两次)。

See the following:

Keep in mind the following:

  • An object may be unequal, yet have the same hash code
  • Your collisions potential depends on how many objects you encounter.
  • How useful hash codes will be depends on how you implement checking

Generally, you can determine the chance of a collision based upon the number of expected objects and the number of possible hashes (max hash value). See http://en.wikipedia.org/wiki/Birthday_paradox for the detailed explanation.

Personally? Java objects (instantiated classes) < 10,000? Hash code. Representing files / blobs / lots of data? SHA-1. I use SHA-1 hashing in my database to keep people from doing ETL work on the same file more than once. I then use SHA-1 hashing again at a second level to keep people from ETLing the same section in more than once file (e.g., different files but the same order shows up twice).

临风闻羌笛 2024-07-27 12:45:23

就我个人而言,我会为对象使用 hashCode() ,直到证明任何可能的冲突都是实际问题,以避免抢先优化您实际上可能没有的问题。

Personally I would use hashCode() for the objects until it's been proven that any possible collisions are an actual problem to avoid preemptively optimizing a problem which you might not actually have.

千秋岁 2024-07-27 12:45:23

由于生日问题,发生冲突的可能性取决于您正在处理的项目数量。

SHA-1 的 160 位空间是如此之大,以至于我怀疑您是否有足够的项目来看到碰撞。

在拥有超过 50,000 个项目之前,hashCode() 的 32 位空间不应出现大量冲突。 然而,这取决于使用良好的哈希算法。

为了应用像 SHA-1 这样的加密摘要,您需要将图形转换为字节字符串,这可能需要大量计算,而且可能很复杂。

Because of the birthday problem, the chance of a collision depends on how many items you are working with.

The 160-bit space of SHA-1 is so large that I doubt you could ever have enough items to see a collision.

The 32-bit space of hashCode() should not have a significant number of collisions until you have over 50,000 items. However, this depends on using a good hash algorithm.

In order to apply a cryptographic digest like SHA-1, you'll need to convert your graph to a string of bytes, which is likely to be computationally expensive, and could be complicated.

如此安好 2024-07-27 12:45:23

通常对于重复文件/数据检测,MD5 是速度和冲突机会之间的良好权衡。 如果有人故意制作文件来欺骗您的程序(它很容易受到碰撞攻击),那么 MD5 是不合适的。 但如果您只是担心偶然的冲突,那么目前它的 128 位宽度实际上总是足够的。

SHA-1 和 SHA-256 为您提供了一些针对蓄意冲突攻击的保护(SHA-1 的理论攻击,但尚无已知的实际攻击;对于密钥数据,很少值得超过 160 位哈希码宽度)。 SHA-1 的速度大约是 MD5 的一半。

当然,如果您使用 MD5,性能可能不会成为太大的问题。 但显然这确实取决于数据的大小。 您可能对我整理的有关安全哈希函数的性能的一些信息感兴趣爪哇。

如果您确实需要更快的速度并且只处理几百万项数据,那么可以考虑的另一个选择是 Numerical Recipes 作者提出的 64 位哈希算法。

Java 的标准 hashCode() 实现(比如 String)可能不合适:除了有关哈希质量的任何问题之外,它的 32 位宽度意味着您预计在 16,000 个项目左右后就会发生冲突。

Usually for duplicate file/data detection, MD5 is a good tradeoff between speed and chance of collision. MD5 is inappropriate if somebody could be deliberately crafting files to fool your program (it is slightly vulnerable to collision attacks). But if you're just worried about collisions by chance, then its 128-bit width is practically always sufficient at present.

SHA-1 and SHA-256 give you some protection against deliberate collision attacks (theoretical but no practical attacks with SHA-1 are known; for keying data, it's rarely worth going beyon a 160-bit hash code width). SHA-1 is roughly half the speed of MD5.

Certainly if you use MD5, performance probably shouldn't be too much of an issue. But obviously this does depend on the size of your data. You may be interested in some information I put together about performance of secure hash functions in Java.

If you really do need something faster and you're only dealing with a few million items of data, then another option to consider is the 64-bit hash algorithm proposed by the Numerical Recipes authors.

Java's standard hashCode() implementation (of, say, String) is probably not suitable: aside from any issues about the quality of the hash, its 32-bit width means that you'll expect a collision after just 16,000 items or so.

总攻大人 2024-07-27 12:45:23

我赞同 matt b 的说法“在需要优化之前不要优化”。

但是,如果您决定以后需要的不仅仅是哈希码……我使用消息摘要(在我的例子中为 MD5)来“唯一”识别从 RSS 提要下载的各种项目,因此我最终没有得到相同的结果当我一遍又一遍地轮询时,该项目在列表中多次出现。 这些通常都是小帖子,因此可以快速计算摘要。 根据我的经验,它非常有效并且效果很好。

由于它们通常是一种单向函数,即使输入数据中非常小的变化也会产生强烈的反应,因此您绝对不太可能与 MD5 或 SHA-1 发生冲突。

I'll endorse matt b's saying "don't optimize before you need to optimize."

However, should you decide you need something more than the hash code down the road... I used message digests (MD5 in my case) to "uniquely" identify various items downloaded from RSS feeds so I didn't end up with the same item appearing many times in the list as I polled over and over. Those were typically small postings so the digest could be calculated quickly. In my experience it was very effective and worked well.

Since they normally are one way functions meant to react strongly to even very small changes in the input data, you are definitely less likely to get collisions with MD5 or SHA-1.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文