当前位置：文江博客话题详情

哪种哈希算法可用于重复内容验证？

发布于 2024-12-17 16:11:28 字数 152 浏览 1 评论 0原文

我有一个 xml 文件，我需要确定它是否重复。

我将散列整个 xml 文件，或者使用 xml 文件中的特定 xml 节点生成某种散列。

md5适合这个吗？

或者其他什么？生成哈希值的速度也相当重要，但保证为唯一数据生成唯一的哈希值更为重要。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

離殇 2024-12-24 16:11:29

MD5 已损坏（从某种意义上说，可能会故意生成哈希冲突），如果您担心某人恶意，您可能应该使用 SHA 系列（例如：SHA-256 或 SHA-2）创建一个与另一个文件具有相同哈希值的文件。

请注意，哈希函数本质上不能保证每个可能的输入都有唯一的哈希值。哈希函数的长度有限（例如：MD5 的长度为 128 位，因此有 2¹²⁸ 可能的哈希值）。您无法将潜在的无限域映射到有限的共域，这在数学上是不可能的。

然而，根据生日悖论，良好的哈希函数发生冲突的几率是二分之一^{n/ 2}，其中 n 是以位为单位的长度。（例如：对于 128 位 MD5，则为 2⁶⁴）。这在统计上是微不足道的，因此您不必担心意外发生碰撞。

回复收藏 0 原文

饮湿 2024-12-24 16:11:29

MD5 合适且快速。但请注意，一个字符中的一个差异将产生完全不同的 MD5。

MD5 有可能为不同的输入生成相同的哈希值。这将是非常罕见的。因此，根据您的输入（您期望许多相似的 XML 还是许多不同的 XML？），当 MD5 为您提供肯定匹配时，您可以比较纯字符串内容。

回复收藏 0 原文

橘味果▽酱 2024-12-24 16:11:29

如果某人可以至少部分更改某些 XML 文件的内容，并且某人具有优势，可以让您声明两个 XML 文件（或 XML 摘录）相同，而实际上它们不相同，那么您需要一个加密安全哈希函数，即能够抵抗碰撞的。冲突是一对不同的消息（字节序列），它们产生相同的哈希输出——这正是您想要避免的。由于哈希函数接受的输入比其输出长，因此必然存在冲突；当没有人能够真正产生这样的冲突时，哈希函数就被认为是加密安全的。

如果哈希函数输出 n 位，那么在对 2^n/2 个不同消息进行哈希处理后，预计会发现冲突。安全散列函数是一种已知没有比该函数更快地获得冲突的方法的散列函数。

如果不存在安全问题（即没有人会主动尝试寻找碰撞，您只是担心运气不好而发生碰撞），那么加密弱哈希函数是一种选择，前提是它们具有足够大的输出，以便2^n/2 仍然比您要比较的 XML 文件的预期数量大得多。对于n = 128（即2^n/2接近一百八十亿），MD5 很好、快速并且得到广泛支持。您可能需要研究MD4，它更弱，但也更快一点。如果您想要更大的n，请尝试SHA-1 ，它提供 160 位输出（此外，SHA-1 的弱点目前仍然是理论上的，因此 SHA-1 比 MD5 的“加密破坏”要少得多）。

如果您存在安全问题，甚至是潜在的安全问题，请使用 SHA-256。目前，该函数尚不存在与冲突有关的加密弱点。如果遇到性能问题（这是不太可能的：在基本 PC 上，SHA-256 每秒可以处理超过 100 兆字节的数据，因此 XML 解析的成本可能比散列要高得多），请考虑 SHA-512 ，在提供 64 位整数类型的平台上速度稍快（但在不提供 64 位整数类型的平台上速度相当慢）。

请注意，所有这些哈希函数都与字节序列有关。单个翻转位会改变输出。在 XML 世界中，给定文档可以用各种方式进行编码，这些方式在语义上相同，但就线路上的位而言是不同的（例如 é 和 &# 233 都代表相同的字符é）。由您决定要使用哪种平等概念；请参阅规范 XML。

If someone can alter at least partially the contents of some of the XML files, and that someone has an advantage in making you declare two XML files (or XML excerpts) identical while in fact they are not, then you need a cryptographically secure hash function, namely one which is resistant to collisions. A collision is a pair of distinct messages (sequences of bytes) which yield the same hash output -- exactly what you would like to avoid. Since a hash function accepts inputs longer than its output, collisions necessarily exist; a hash function is deemed cryptographically secure when nobody can actually produce such a collision.

If a hash function outputs n bits, then one can expect to find a collision after hashing about 2^n/2 distinct messages. A secure hash function is a hash function such that no method is known to get a collision faster than that.

If there is no security issue (i.e. nobody will actively try to find a collision, you just fear a collision out of bad luck), then cryptographically weak hash functions are an option, provided that they have a large enough output, so that 2^n/2 remains way bigger than the expected number of XML files you will compare. For n = 128 (i.e. 2^n/2 close to eighteen billions of billions), MD5 is fine, fast and widely supported. You may want to investigate MD4, which is even weaker, but a bit faster too. If you want a larger n, try SHA-1, which offers 160-bit outputs (also, SHA-1 weaknesses are still theoretical at the moment, so SHA-1 is much less "cryptographically broken" than MD5).

If you have, even potentially, security issues, then go for SHA-256. No cryptographic weakness with regards to collisions is currently known for that function. If you run into performance issues (which is rather improbable: on a basic PC, SHA-256 can process more than 100 megabytes of data per second, so chances are that XML parsing will be widely more expensive than hashing), consider SHA-512, which is somewhat faster on platforms which offer 64-bit integer types (but quite slower on platforms which do not).

Note that all these hash functions are about sequences of bytes. A single flipped bit changes the output. In the XML world, a given document can be encoded in various ways which are semantically identical, but distinct as far as bits on the wire are concerned (e.g. é and é both represent the same character é). It is up to you to define which notion of equality you want to use; see canonical XML.

回复收藏 0 原文

~没有更多了~