交换电子邮件地址（散列）的安全方式，以允许匹配另一个列表上的重叠，但不透露那些没有重叠的列表？

发布于 2024-11-10 17:19:59 字数 1334 浏览 8 评论 0原文

我所在的组织（A 公司）拥有大量电子邮件列表。我将此列表的 10,000 封电子邮件子集发送给另一个组织（公司 B）以测试重叠情况（发现哪些电子邮件地址同时出现在两个列表中）。我想以一种易于 B 公司测试重叠的方式发送列表，但对 B 公司来说很难（理想情况下不可能）“解码”不在其列表中的电子邮件地址。其次，我想确保如果我发送的列表最终落入坏人之手（某些第三方），其他任何人都很难了解列表中的实际电子邮件地址。

我当前的解决方案是简单地从我们的数据库中提取电子邮件，

SHA1(email + a_long_random_salt)

为每个电子邮件地址使用相同的盐。

为了进行匹配，我将哈希值列表和盐（安全地、单独地）发送给 B 公司，他们只需使用

SELECT email FROM members WHERE SHA1(email + the_salt) IN(hash1, hash2, hash3....)

（或者他们预先计算每个地址的 SHA1 哈希值并将其存储在数据库中）即可搜索数据库电子邮件地址，因此在运行查询时不需要进行散列）

足够长/随机的盐可以防止使用预先计算的彩虹表来破解散列。我认为不太可能有人拥有一个包含数百万个可信电子邮件地址的彩虹表，其中包含我用作盐的 100 个字符的随机字符串。只要盐保密，任何第三方都不会使用彩虹表或暴力破解此列表。（如果我在这里有什么错误，请纠正我。）

我正在努力解决的问题是，显然很容易获得从网络上收集的数百万个电子邮件地址的列表。公司 B 可以很容易地获取这些列表之一，使用我提供的盐计算哈希值，并恢复我发送的列表上的电子邮件的一些重要部分（当然不是全部，但很大一部分）。

是否有一些我未能想出的策略来完成这场比赛？我唯一能想到的就是使用更复杂的哈希方法（即多次迭代）来使其更慢匹配数亿个电子邮件地址的列表（该理论列表是从网）。关键是它实际上只会更慢——甚至不会困难。另外，我知道 B 公司自己的电子邮件列表包含 100 万个地址，因此我无法为他们提供一个哈希方案，该方案需要花费很多秒来计算该 100 万个列表中的每个地址。简单地减慢速度并不能解决问题——我认为我需要一种完全不同的方法。

老实说，这个特殊案例对我来说更多的是一种学术练习，而不是真正的安全问题。我相信B公司不会这样做（我们经常合作），即使他们这样做了也不会造成太大损失。他们可能了解到的只是我们邮件列表中 10,000 人的电子邮件地址 - 我们不是在谈论密码、信用卡号等。如果我们要处理密码或信用卡号，我什至不会考虑开发我自己的一些计划。是的，我当然意识到 SHA-256 或其他一些较新的算法可能比 SHA1 更可取，但仅限于非常有限的程度。我在这里担心的不是对哈希值的暴力破解。

原文

I'm with an organization (Company A) that has a large email list. I'm sending a 10,000 email subset of this list to another organization (Company B) to test for overlap (discover which email addresses are on both lists). I want to send the list in a way that is easy for Company B to test for overlap, but difficult (ideally impossible) for Company B to "decode" the email addresses which are NOT already on their list. Secondarily, I want to ensure that if the list I send winds up in the wrong hands (some 3rd party), it would be difficult for anyone else to learn the actual email addresses on the list.

My current solution is to simply pull the emails from our database as

SHA1(email + a_long_random_salt)

Using the same salt for each email address.

To do the match, I send the list of hashes and the salt (securely, separately) to Company B, and they simply search their database using

SELECT email FROM members WHERE SHA1(email + the_salt) IN(hash1, hash2, hash3....)

(Or they pre-compute the SHA1 hash for each address and store it in the DB with the email address so the hashing doesn't need to happen as the query is run)

A sufficiently long/random salt prevents against use of a precomputed rainbow table to crack the hashes. I assume it to be rather unlikely that anyone has a rainbow table of millions upon millions of plausible email addresses salted with whatever 100 character random string I use as my salt. As long as the salt is kept secret, no 3rd party is going to decode this list with a rainbow table or brute force. (Please, correct me if I'm somehow wrong here.)

The issue that I'm struggling with is there are obviously easily-obtained lists of millions upon millions of email addresses harvested from the web. It would be pretty easy for Company B to obtain one of these lists, compute the hashes using the salt I've provided, and recover some significant portion of emails on the list I've sent (certainly not all, but a significant portion).

Is there some strategy to accomplish this match that I'm failing to come up with? The only thing I can think of is to use a more complex hashing method (i.e. multiple iterations) to make it slower to match against a list of hundreds of millions of email addresses (that theoretical list scraped from the web). The key is that it would really only be slower -- not really even difficult. Also, I know that Company B's own email list is in the range of 1 million addresses, so I can't give them a hashing scheme that would take many seconds to compute for each address on that list of 1 million. Simply making it slower doesn't solve the issue -- I think I need a completely different approach.

Honestly, this particular case this is more of an academic exercise for me than a real security concern. I trust Company B is not going to try to do this (we work together often), and even if they did it would be no huge loss. All they could possibly learn is email addresses of 10,000 people on our mailing list -- we're not talking about passwords, credit card numbers, etc. If we were dealing with passwords or credit card numbers, I wouldn't even be considering developing some scheme of my own. And, yes, of course I realize that SHA-256 or some other newer algorithm might be a bit preferable to SHA1, but only to some very limited extent. It's not a brute force crack of the hash that I'm worried about here.

分享到QQ

分享到微博