电子邮件的 10 个字符的哈希值
使用 10 个字符的哈希来识别电子邮件地址的可靠性如何?
MailChimp 的电子邮件地址有 10 个字符的字母数字 ID。 10 个字符 4 位,每个字符提供 40 位,略高于一万亿。也许对于像 MailChimp 这样规模的企业来说,这为唯一的索引空间提供了合理的空间,并且他们有一个包含所有可能的电子邮件的表,并使用 40 位数字进行索引。
我喜欢使用相同风格的哈希值或编码 ID 来包含在链接中。要决定是使用索引还是哈希,需要估计两个有效电子邮件地址产生相同 10 个字符哈希的概率。
除了原始测试之外,还有任何评估自定义哈希函数的提示吗?
How reliable is it to use a 10-char hash to identify email addresses?
MailChimp has 10-character alphanumeric IDs for email addresses.
10 chars 4 bit each gives 40 bits, a bit over one trillion. Maybe for an enterprise sized like MailChimp this gives a reasonable headroom for a unique index space, and they have a single table with all possible emails, indexed with a 40-bit number.
I'd love to use same style of hashes or coded IDs to include in links. To decide whether to go for indexes or hashes, need to estimate a probability of two valid email addresses leading to the same 10-char hash.
Any hints to evaluating that for a custom hash function, other than raw testing?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您没有明确说明“可靠”的含义,但我认为您正在努力避免冲突。正如 Wildplasser 所说,对于随机标识符来说,一切都与生日悖论有关,当使用 2^(n/2) 个 ID 时,具有 2^n 个 ID 的标识符空间中发生冲突的几率达到 50%。
关于生日攻击的维基百科页面有一个很好的表格,说明了各种参数下的碰撞概率;例如,如果使用 64 位且所需的最大冲突概率为百万分之一,则您可以拥有大约 600 万个标识符。
请记住,有很多比十六进制更有效的方法来表示字符数据;例如,base64 每 4 个字符提供 3 个字节,这意味着 10 个字符提供 60 位,而不是十六进制的 40 位。
You don't explicitly say what you mean by "reliable", but I presume you're trying to avoid collisions. As wildplasser says, for random identifiers it's all about the birthday paradox, and the chance of a collision in an identifier space with 2^n IDs reaches 50% when 2^(n/2) IDs are in use.
The Wikipedia page on Birthday Attacks has a great table illustrating probabilities for collisions under various parameters; for instance with 64 bits and a desired maximum collision probability of 1 in 1 million, you can have about 6 million identifiers.
Bear in mind that there are a lot more efficient ways to represent data in characters than hex; base64, for instance, gives you 3 bytes per 4 characters, meaning 10 characters gives you 60 bits, instead of 40 with hex.