密码哈希在某些条件下是单射的吗？

发布于 2024-12-11 08:26:41 字数 1104 浏览 0 评论 0原文

抱歉这篇冗长的文章，我有一个关于常见密码哈希算法的问题，例如 SHA 系列、MD5 等。

一般来说，这样的哈希算法不能是单射的，因为生成的实际摘要通常具有固定长度（例如， SHA-1 下的 160 位等），而要消化的可能消息的空间实际上是无限的。

但是，如果我们生成一条消息的摘要，并且最多与生成的摘要一样长，那么常用的哈希算法有哪些属性？他们是否可能在这个有限的消息空间上单射？是否存在已知即使在位长度比生成的摘要的位长度短的消息上也会产生冲突的算法？

我实际上正在寻找一种算法，它具有这个属性，即，至少在原则上，即使对于短输入消息也可能生成冲突哈希。

背景：我们有一个浏览器插件，对于每个访问的网站，它都会向服务器发出请求，询问该网站是否属于我们已知的合作伙伴之一。但当然，我们不想监视我们的用户。因此，为了使生成某种冲浪历史记录变得困难，我们实际上并不发送访问的 URL，而是发送某些清理版本的哈希摘要（当前为 SHA-1）。在服务器端，我们有一个众所周知的 URI 的哈希表，它与接收到的哈希进行匹配。我们可以忍受一定程度的不确定性，因为我们认为无法跟踪用户是一个功能，而不是一个错误。

由于显而易见的原因，该方案非常模糊，并且会出现误报以及与本应匹配的 URI 不匹配。

所以现在，我们正在考虑将指纹生成更改为具有更多结构的东西，例如，我们可以不散列完整（清理后的）URI，而是：

将主机名拆分为“.”处的组件。并散列那些单独
检查“.”处组件的路径。并对它们分别进行哈希处理

将生成的哈希值连接成指纹值。示例：使用此方案（并使用 Adler 32 作为散列）对“www.apple.com/de/shop”进行散列可能会产生“46989670.104268307.41353536/19857610/73204162”。

然而，由于这样的指纹具有很多结构（特别是与普通的 SHA-1 摘要相比），我们可能会意外地使计算用户访问的实际 URI 变得非常容易（例如，通过使用 pre -“常见”组件值（例如“www”）的哈希值计算表。

所以现在，我正在寻找一种哈希/摘要算法，即使在短消息上，它也具有很高的冲突率（认真考虑Adler 32），因此给定组件哈希唯一的概率很低。我们希望，我们施加的附加结构为我们提供足够的附加信息来改进匹配行为（即降低误报/误报率）。

原文

sorry for the lengthy post, I have a question about common crpytographic hashing algorithms, such as the SHA family, MD5, etc.

In general, such a hash algorithm cannot be injective, since the actual digest produced has usually a fixed length (e.g., 160 bits under SHA-1, etc.) whereas the space of possible messages to be digested is virtually infinite.

However, if we generate a digest of a message which is at most as long as the digest generated, what are the properties of the commonly used hashing algorithms? Are they likely to be injective on this limited message space? Are there algorithms, which are known to produce collisions even on messages whose bit lengths is shorter than the bit length of the digest produced?

I am actually looking for an algorithm, which has this property, i.e., which, at least in principle, may generate colliding hashes even for short input messages.

Background: We have a browser plug-in, which for every web-site visited, makes a server request asking, whether the web-site belongs to one of our known partners. But of course, we don't want to spy on our users. So, in order to make it hard to generate some kind of surf history, we do not actually send the URL visited, but a hash digest (currently, SHA-1) of some cleaned up version. On the server side, we have a table of hashes of well-known URIs, which is matched against the received hash. We can live with a certain amount of uncertainty here, since we consider not being able to track our users a feature, not a bug.

For obvious reasons, this scheme is pretty fuzzy and admits false positives as well as URIs not matched which should have.

So right now, we are considering changing the fingerprint generation to something, which has more structure, for example, instead of hashing the full (cleaned up) URI, we might instead:

split the host name into components at "." and hash those individually
check the path into components at "." and hash those individually

Join the resulting hashes into a fingerprint value. Example: hashing "www.apple.com/de/shop" using this scheme (and using Adler 32 as hash) might yield "46989670.104268307.41353536/19857610/73204162".

However, as such a fingerprint has a lot of structure (in particular, when compared to a plain SHA-1 digest), we might accidentally make it pretty easy again to compute actual URI visited by a user (for example, by using a pre-computed table of hash values for "common" compont values, such as "www").

So right now, I am looking for a hash/digest algorithm, which has a high rate of collisions (Adler 32 is seriously considered) even on short messages, so that the probability of a given component hash being unique is low. We hope, that the additional structure we impose provides us with enough additional information to as to improve the matching behaviour (i.e., lower the rate of false positives/false negatives).

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

花想c 2024-12-18 08:26:41

我不相信散列对于与摘要大小相同的消息来说一定是单射的。如果它们是，它们将是双射的，这将失去哈希的意义。这表明它们对于小于摘要的消息也不是单射的。

如果你想鼓励碰撞，我建议你使用任何你喜欢的哈希函数，然后扔掉一些位，直到碰撞足够多。

例如，丢弃 SHA-1 哈希值的 159 位将会产生相当高的冲突率。您可能不想扔掉那么多。

然而，你想要实现的目标似乎本质上是可疑的。您希望能够知道该 URL 是您的 URL 之一，但不知道它是哪一个。这意味着您希望您的 URL 相互冲突，但不希望与不属于您的 URL 发生冲突。哈希函数不会为你做到这一点。相反，因为冲突是随机的，因为不属于您的 URL 比属于您的 URL 多得多（我假设！），所以任何给定级别的冲突都会导致人们对某个 URL 是否属于您的 URL 更加困惑。是你的哪一个。

相反，如何在启动时将 URL 列表发送到插件，然后让它只发回一个位来指示它是否正在访问列表中的 URL？如果您不想显式发送 URL，请发送哈希值（不要尝试最大化冲突）。如果您想节省空间，请发送布隆过滤器。

回复收藏 0 原文

各自安好 2024-12-18 08:26:41

由于您愿意接受一定比例的误报（即，随机站点被识别为白名单，而实际上它们并未列入白名单），因此布隆过滤器可能就是这样。

每个客户端都会下载一个包含整个白名单的布隆过滤器。这样客户端就不需要与服务器进行其他通信，并且不存在间谍风险。

如果每个 URL 为 2 字节，误报率将低于 0.1%；如果每个 URL 为 4 字节，误报率将低于四百万分之一。

下载整个过滤器（也许还需要定期更新）需要大量的带宽投资。但假设它有 100 万个 URL（这对我来说似乎相当多，因为您可能可以在查找之前应用一些规则来规范化 URL），则下载量为 4MB。将此与一百万个 32 位哈希值的列表进行比较：大小相同，但误报率约为四千分之一，因此布隆过滤器在紧凑性方面获胜。

我不知道插件如何联系服务器，但我怀疑您能否在 1kB 以下的时间内完成 HTTP 事务——也许使用保持活动连接会更少。如果过滤器更新频率低于给定用户每 4k URL 访问一次（或者如果 URL 少于 100 万，则更新频率更小，或者误报概率大于 400 万分之一），则有可能使用 比当前方案更少的带宽，当然泄漏的用户信息也少得多。

如果您需要立即将新 URL 列入白名单，那么它的工作效果就不太好，尽管我认为客户端仍然可以在每个页面请求时访问服务器，以检查过滤器是否已更改，如果已更改，则下载更新补丁。

即使布隆过滤器太大而无法完整下载（也许对于客户端没有持久存储并且 RAM 有限的情况），那么我认为您仍然可以通过让客户端计算布隆的哪些位来引入一些信息隐藏过滤它需要查看的内容，并从服务器请求这些内容。通过客户端中的缓存组合（缓存的过滤器比例越高，需要请求的位越少，因此告诉服务器的信息就越少），请求围绕您关心的实际位的窗口（所以你不需要告诉服务器你需要哪一个位），而客户端请求它并不真正需要的虚假位（隐藏噪音中的信息），我希望你可以混淆你正在访问的 URL。不过，需要进行一些分析才能证明其实际效果如何：间谍的目标是在您的请求中找到与浏览特定网站相关的模式。

Since you're willing to accept a rate of false positives (that is, random sites identified as whitelisted when in fact they are not), a Bloom filter might be just the thing.

Each client downloads a Bloom filter containing the whole whitelist. Then the client has no need to otherwise communicate with the server, and there is no risk of spying.

At 2 bytes per URL, the false positive rate would be below 0.1%, and at 4 bytes per URL below 1 in 4 million.

Downloading the whole filter (and perhaps regular updates to it) is a large investment of bandwidth up front. But supposing that it has a million URLs on it (which seems quite a lot to me, given that you can probably apply some rules to canonicalize URLs before lookup), it's a 4MB download. Compare this with a list of a million 32 bit hashes: same size, but the false positive rate would be somewhere around 1 in 4 thousand, so the Bloom filter wins for compactness.

I don't know how the plugin contacts the server, but I doubt that you can get an HTTP transaction done in much under 1kB -- perhaps less with keep-alive connections. If filter updates are less frequent than one per 4k URL visits by a given user (or a smaller number if there are less than a million URLs, or greater than 1 in 4 million false positive probability), this has a chance of using less bandwidth than the current scheme, and of course leaks much less information about the user.

It doesn't work quite as well if you require new URLs to be whitelisted immediately, although I suppose the client could still hit the server at every page request, to check whether the filter has changed and if so download an update patch.

Even if the Bloom filter is too big to download in full (perhaps for cases where the client has no persistent storage and RAM is limited), then I think you could still introduce some information-hiding by having the client compute which bits of the Bloom filter it needs to see, and asking for those from the server. With a combination of caching in the client (the higher the proportion of the filter you have cached, the fewer bits you need to ask for and hence the less you tell the server), asking for a window around the actual bit you care about (so you don't tell the server exactly which bit you need), and the client asking for spurious bits it doesn't really need (hide the information in noise), I expect you could obfuscate what URLs you're visiting. It would take some analysis to prove how much that actually works, though: a spy would be aiming to find a pattern in your requests that's correlated with browsing a particular site.

回复收藏 0 原文