哪种哈希算法可用于重复内容验证?

发布于 2024-12-17 16:11:28 字数 152 浏览 1 评论 0原文

我有一个 xml 文件,我需要确定它是否重复。

我将散列整个 xml 文件,或者使用 xml 文件中的特定 xml 节点生成某种散列。

md5适合这个吗?

或者其他什么?生成哈希值的速度也相当重要,但保证为唯一数据生成唯一的哈希值更为重要。

I have an xml file, where I need to determine if it is a duplicate or not.

I will either hash the entire xml file, or specific xml nodes in the xml file will be used to then generate some kind of hash.

Is md5 suitable for this?

Or something else? Speed in generation of the hash is also fairly important, but the guarantee to produce a unique hash for unique data is of higher important.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

離殇 2024-12-24 16:11:29

MD5 已损坏(从某种意义上说,可能会故意生成哈希冲突),如果您担心某人恶意,您可能应该使用 SHA 系列(例如:SHA-256 或 SHA-2)创建一个与另一个文件具有相同哈希值的文件。


请注意,哈希函数本质上不能保证每个可能的输入都有唯一的哈希值。哈希函数的长度有限(例如:MD5 的长度为 128 位,因此有 2128 可能的哈希值)。您无法将潜在的无限域映射到有限的共域,这在数学上是不可能的。

然而,根据生日悖论,良好的哈希函数发生冲突的几率是二分之一n/ 2,其中 n 是以位为单位的长度。 (例如:对于 128 位 MD5,则为 264)。这在统计上是微不足道的,因此您不必担心意外发生碰撞。

MD5 is broken (in the sense that it's possible to intentionally generate a hash collision), you should probably use the SHA family (eg: SHA-256 or SHA-2) if you are concerned about someone maliciously creating a file with the same hash as another file.


Note that hash functions, by their nature, cannot guarantee a unique hash for every possible input. Hash functions have a limited length (eg: MD5 is 128 bits in length, so there are 2128 possible hashes). You can't map a potentially infinite domain to a finite co-domain, this is mathematically impossible.

However, as per birthday paradox, the chances of a collision in a good hash function is 1 in 2n/2, where n is the length in bits. (eg: With 128-bit MD5 that would be 264). This is so statistically insignificant that you don't have to worry about a collision happening by accident.

饮湿 2024-12-24 16:11:29

MD5 合适且快速。但请注意,一个字符中的一个差异将产生完全不同的 MD5。

MD5 有可能为不同的输入生成相同的哈希值。这将是非常罕见的。因此,根据您的输入(您期望许多相似的 XML 还是许多不同的 XML?),当 MD5 为您提供肯定匹配时,您可以比较纯字符串内容。

MD5 is suitable and fast. Note though that a single difference in one character will produce a completely different MD5.

There is a slight chance that MD5 will produce the same hash for different inputs. This will be pretty rare. So, depending on your input (are you expecting many similar XMLs or many different ones?) when MD5 gives you a positive match you can compare the plain String contents.

橘味果▽酱 2024-12-24 16:11:29

如果某人可以至少部分更改某些 XML 文件的内容,并且某人具有优势,可以让您声明两个 XML 文件(或 XML 摘录)相同,而实际上它们不相同,那么您需要一个加密安全哈希函数,即能够抵抗碰撞的。冲突是一对不同的消息(字节序列),它们产生相同的哈希输出——这正是您想要避免的。由于哈希函数接受的输入比其输出长,因此必然存在冲突;当没有人能够真正产生这样的冲突时,哈希函数就被认为是加密安全的。

如果哈希函数输出 n 位,那么在对 2n/2 个不同消息进行哈希处理后,预计会发现冲突。安全散列函数是一种已知没有比该函数更快地获得冲突的方法的散列函数。

如果不存在安全问题(即没有人会主动尝试寻找碰撞,您只是担心运气不好而发生碰撞),那么加密弱哈希函数是一种选择,前提是它们具有足够大的输出,以便2n/2 仍然比您要比较的 XML 文件的预期数量大得多。对于n = 128(即2n/2接近一百八十亿),MD5 很好、快速并且得到广泛支持。您可能需要研究MD4,它更弱,但也更快一点。如果您想要更大的n,请尝试SHA-1 ,它提供 160 位输出(此外,SHA-1 的弱点目前仍然是理论上的,因此 SHA-1 比 MD5 的“加密破坏”要少得多)。

如果您存在安全问题,甚至是潜在的安全问题,请使用 SHA-256。目前,该函数尚不存在与冲突有关的加密弱点。如果遇到性能问题(这是不太可能的:在基本 PC 上,SHA-256 每秒可以处理超过 100 兆字节的数据,因此 XML 解析的成本可能比散列要高得多),请考虑 SHA-512 ,在提供 64 位整数类型的平台上速度稍快(但在不提供 64 位整数类型的平台上速度相当慢)。

请注意,所有这些哈希函数都与字节序列有关。单个翻转位会改变输出。在 XML 世界中,给定文档可以用各种方式进行编码,这些方式在语义上相同,但就线路上的位而言是不同的(例如 é&# 233 都代表相同的字符é)。由您决定要使用哪种平等概念;请参阅规范 XML

If someone can alter at least partially the contents of some of the XML files, and that someone has an advantage in making you declare two XML files (or XML excerpts) identical while in fact they are not, then you need a cryptographically secure hash function, namely one which is resistant to collisions. A collision is a pair of distinct messages (sequences of bytes) which yield the same hash output -- exactly what you would like to avoid. Since a hash function accepts inputs longer than its output, collisions necessarily exist; a hash function is deemed cryptographically secure when nobody can actually produce such a collision.

If a hash function outputs n bits, then one can expect to find a collision after hashing about 2n/2 distinct messages. A secure hash function is a hash function such that no method is known to get a collision faster than that.

If there is no security issue (i.e. nobody will actively try to find a collision, you just fear a collision out of bad luck), then cryptographically weak hash functions are an option, provided that they have a large enough output, so that 2n/2 remains way bigger than the expected number of XML files you will compare. For n = 128 (i.e. 2n/2 close to eighteen billions of billions), MD5 is fine, fast and widely supported. You may want to investigate MD4, which is even weaker, but a bit faster too. If you want a larger n, try SHA-1, which offers 160-bit outputs (also, SHA-1 weaknesses are still theoretical at the moment, so SHA-1 is much less "cryptographically broken" than MD5).

If you have, even potentially, security issues, then go for SHA-256. No cryptographic weakness with regards to collisions is currently known for that function. If you run into performance issues (which is rather improbable: on a basic PC, SHA-256 can process more than 100 megabytes of data per second, so chances are that XML parsing will be widely more expensive than hashing), consider SHA-512, which is somewhat faster on platforms which offer 64-bit integer types (but quite slower on platforms which do not).

Note that all these hash functions are about sequences of bytes. A single flipped bit changes the output. In the XML world, a given document can be encoded in various ways which are semantically identical, but distinct as far as bits on the wire are concerned (e.g. é and é both represent the same character é). It is up to you to define which notion of equality you want to use; see canonical XML.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文