电子邮件的唯一标识符
我正在编写一个 C# 应用程序,它允许用户将电子邮件存储在 MS SQL Server 数据库中。很多时候,一个客户的电子邮件会被多个用户复制。如果他们都尝试将相同的电子邮件添加到数据库中,我想确保该电子邮件仅添加一次。
MD5 就是一种实现此目的的方法。我不需要担心恶意篡改,只需确保同一封电子邮件将映射到相同的哈希值,并且不会有两封具有不同内容的电子邮件映射到相同的哈希值。
我的问题实际上归结为如何将多个字段组合成一个 MD5(或其他)哈希值。其中一些字段每封电子邮件只有一个值(例如主题、正文、发件人电子邮件地址),而其他字段则有多个值(不同数量的附件、收件人)。我想开发一种唯一标识电子邮件的方法,该方法将独立于平台和语言(不基于序列化)。有什么建议吗?
I am writing a C# application which allows users to store emails in a MS SQL Server database. Many times, multiple users will be copied on an email from a customer. If they all try to add the same email to the database, I want to make sure that the email is only added once.
MD5 springs to mind as a way to do this. I don't need to worry about malicious tampering, only to make sure that the same email will map to the same hash and that no two emails with different content will map to the same hash.
My question really boils down to how one would combine multiple fields into one MD5 (or other) hash value. Some of these fields will have a single value per email (e.g. subject, body, sender email address) while others will have multiple values (varying numbers of attachments, recipients). I want to develop a way of uniquely identifying an email that will be platform and language independent (not based on serialization). Any advice?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
您计划归档多少电子邮件?如果您不希望存档需要很多 TB,我认为这是一个不成熟的优化。
由于每个字段都可以表示为字符串或字节数组,因此无论它包含多少个值,对于哈希函数来说,它们看起来都是一样的。只需将它们散列在一起,您就会得到一个唯一的标识符。
编辑 Psuedocode 示例
如果将所有更新调用替换为“
提取字符串的方式”,您将获得相同的输出,并且界面将根据您的应用程序、语言和 API 的不同而有所不同。
当给定相同的输入时,不同的电子邮件客户端可能会为某些字段生成不同的格式,但这并不重要,这将为您提供原始电子邮件唯一的哈希值。
What volume of emails do you plan on archiving? If you don't expect the archive require many terabytes, I think this is a premature optimization.
Since each field can be represented as a string or array of bytes, it doesn't matter how many values it contains, it all looks the same to a hash function. Just hash them all together and you will get a unique identifier.
EDIT Psuedocode example
You will get the same output if you replace all the update calls with
How you extract the strings and interface will vary based on your application, language, and api.
It doesn't matter that different email clients might produce different formatting for some of the fields when given the same input, this will give you a hash unique to the original email.
您是否看过其他一些标头,例如(在我的邮件中,OS X Mail):
至少需要 Message-Id。对于相同的邮件(发送给多个收件人),该字段很可能是相同的。这比散列更有效。
不是问题的答案,但也许是问题的答案:)
Have you looked at some other headers like (in my mail, OS X Mail):
At least the Message-Id is required. That field could well be the same for the same mailing (send to multiple recipients). That would be more effective than hashing.
Not the answer to the question, but maybe the answer to the problem :)
为什么不直接对原始消息进行哈希处理呢?它已经对除信封发件人和收件人之外的所有相关字段进行了编码,您可以在散列之前自行将这些字段添加为标头。它还包含所有附件、整个消息正文等,这是一种自然而简单的表示。它也不会受到 mikerobi 提案中容易生成的哈希冲突的影响。
Why not just hash the raw message? It already encodes all the relevant fields except the envelope sender and recipient, and you can add those as headers yourself, before hashing. It also contains all the attachments, the entire body of the message, etc, and it's a natural and easy representation. It also doesn't suffer from the easily generated hash collisions of mikerobi's proposal.