交换电子邮件地址(散列)的安全方式,以允许匹配另一个列表上的重叠,但不透露那些没有重叠的列表?
我所在的组织(A 公司)拥有大量电子邮件列表。我将此列表的 10,000 封电子邮件子集发送给另一个组织(公司 B)以测试重叠情况(发现哪些电子邮件地址同时出现在两个列表中)。我想以一种易于 B 公司测试重叠的方式发送列表,但对 B 公司来说很难(理想情况下不可能)“解码”不在其列表中的电子邮件地址。其次,我想确保如果我发送的列表最终落入坏人之手(某些第三方),其他任何人都很难了解列表中的实际电子邮件地址。
我当前的解决方案是简单地从我们的数据库中提取电子邮件,
SHA1(email + a_long_random_salt)
为每个电子邮件地址使用相同的盐。
为了进行匹配,我将哈希值列表和盐(安全地、单独地)发送给 B 公司,他们只需使用
SELECT email FROM members WHERE SHA1(email + the_salt) IN(hash1, hash2, hash3....)
(或者他们预先计算每个地址的 SHA1 哈希值并将其存储在数据库中) 即可搜索数据库电子邮件地址,因此在运行查询时不需要进行散列)
足够长/随机的盐可以防止使用预先计算的彩虹表来破解散列。我认为不太可能有人拥有一个包含数百万个可信电子邮件地址的彩虹表,其中包含我用作盐的 100 个字符的随机字符串。只要盐保密,任何第三方都不会使用彩虹表或暴力破解此列表。 (如果我在这里有什么错误,请纠正我。)
我正在努力解决的问题是,显然很容易获得从网络上收集的数百万个电子邮件地址的列表。公司 B 可以很容易地获取这些列表之一,使用我提供的盐计算哈希值,并恢复我发送的列表上的电子邮件的一些重要部分(当然不是全部,但很大一部分) 。
是否有一些我未能想出的策略来完成这场比赛?我唯一能想到的就是使用更复杂的哈希方法(即多次迭代)来使其更慢匹配数亿个电子邮件地址的列表(该理论列表是从网)。关键是它实际上只会更慢——甚至不会困难。另外,我知道 B 公司自己的电子邮件列表包含 100 万个地址,因此我无法为他们提供一个哈希方案,该方案需要花费很多秒来计算该 100 万个列表中的每个地址。简单地减慢速度并不能解决问题——我认为我需要一种完全不同的方法。
老实说,这个特殊案例对我来说更多的是一种学术练习,而不是真正的安全问题。我相信B公司不会这样做(我们经常合作),即使他们这样做了也不会造成太大损失。他们可能了解到的只是我们邮件列表中 10,000 人的电子邮件地址 - 我们不是在谈论密码、信用卡号等。如果我们要处理密码或信用卡号,我什至不会考虑开发我自己的一些计划。是的,我当然意识到 SHA-256 或其他一些较新的算法可能比 SHA1 更可取,但仅限于非常有限的程度。我在这里担心的不是对哈希值的暴力破解。
I'm with an organization (Company A) that has a large email list. I'm sending a 10,000 email subset of this list to another organization (Company B) to test for overlap (discover which email addresses are on both lists). I want to send the list in a way that is easy for Company B to test for overlap, but difficult (ideally impossible) for Company B to "decode" the email addresses which are NOT already on their list. Secondarily, I want to ensure that if the list I send winds up in the wrong hands (some 3rd party), it would be difficult for anyone else to learn the actual email addresses on the list.
My current solution is to simply pull the emails from our database as
SHA1(email + a_long_random_salt)
Using the same salt for each email address.
To do the match, I send the list of hashes and the salt (securely, separately) to Company B, and they simply search their database using
SELECT email FROM members WHERE SHA1(email + the_salt) IN(hash1, hash2, hash3....)
(Or they pre-compute the SHA1 hash for each address and store it in the DB with the email address so the hashing doesn't need to happen as the query is run)
A sufficiently long/random salt prevents against use of a precomputed rainbow table to crack the hashes. I assume it to be rather unlikely that anyone has a rainbow table of millions upon millions of plausible email addresses salted with whatever 100 character random string I use as my salt. As long as the salt is kept secret, no 3rd party is going to decode this list with a rainbow table or brute force. (Please, correct me if I'm somehow wrong here.)
The issue that I'm struggling with is there are obviously easily-obtained lists of millions upon millions of email addresses harvested from the web. It would be pretty easy for Company B to obtain one of these lists, compute the hashes using the salt I've provided, and recover some significant portion of emails on the list I've sent (certainly not all, but a significant portion).
Is there some strategy to accomplish this match that I'm failing to come up with? The only thing I can think of is to use a more complex hashing method (i.e. multiple iterations) to make it slower to match against a list of hundreds of millions of email addresses (that theoretical list scraped from the web). The key is that it would really only be slower -- not really even difficult. Also, I know that Company B's own email list is in the range of 1 million addresses, so I can't give them a hashing scheme that would take many seconds to compute for each address on that list of 1 million. Simply making it slower doesn't solve the issue -- I think I need a completely different approach.
Honestly, this particular case this is more of an academic exercise for me than a real security concern. I trust Company B is not going to try to do this (we work together often), and even if they did it would be no huge loss. All they could possibly learn is email addresses of 10,000 people on our mailing list -- we're not talking about passwords, credit card numbers, etc. If we were dealing with passwords or credit card numbers, I wouldn't even be considering developing some scheme of my own. And, yes, of course I realize that SHA-256 or some other newer algorithm might be a bit preferable to SHA1, but only to some very limited extent. It's not a brute force crack of the hash that I'm worried about here.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您可以将交换作为安全多方计算问题进行 - 目标是计算唯一的电子邮件地址。
引用维基百科
如果您访问页面 http://en.wikipedia.org/wiki/Secure_multiparty_computation
“外部链接”部分包含可帮助您入门的库和参考资料。
You can conduct the exchange as a Secure Multi-Party Computation problem - with the goal of computing the unique email addresses.
Quoting Wikipedia
If you visit the page http://en.wikipedia.org/wiki/Secure_multiparty_computation
There "external links" section contains libraries and references to get you started.
我能想到的一件事是对已知域的暴力攻击。请考虑以下因素:
将名字 John 和姓氏 Doe 组合起来,我们可以构造一组地址,如 [电子邮件受保护]、[电子邮件受保护] 、[电子邮件受保护]等。该集不会非常广泛。
根据这种数据挖掘的重要性/有益程度(即 B 会从知道 John Doe 在您的列表中获得多少收益),我所描述的攻击仍然可以有利可图。是的,我记得有关盐的事情,但是名称/域组合的数量仍然不是太大,以至于对于良好的并行暴力攻击来说是牢不可破的。
One thing I can think of is a brute force attack on known domains. Consider the following factors:
Taking the combination of the name John and surname Doe, we can construct a set of addresses like [email protected], [email protected], [email protected] etc. The set won't be very extensive.
Depending on how important / benefitial such data mining is (i.e. how much will B gain from knowing that John Doe is in your list), attack that I described can still be profitable. Yes, I remember about salts, but still the number of name/domain combinations is not too large to be unbreakable for good parallel brute force attack.
在我看来,您的问题可以重述为:
重新声明一下,这似乎在逻辑上是不可能的 - 他们的客户数据库和他们可能拥有的电子邮件地址列表之间实际上没有区别在别处下载的。
It appears to me that your problem can be restated as:
Re-stated like that, it appears to be a logical impossibility - there is really no difference between their customer database and a list of email addresses they may have downloaded elsewhere.