对 SMTP 和 NNTP 消息进行哈希处理?
我想使用一些基于消息正文+标头计算的哈希代码将所有历史电子邮件和新闻存储为单独的消息文件并为其建立索引。然后我也会对其他东西建立索引——用于搜索。
对于主索引键,我的想法是使用 SHA-1 作为哈希算法,并假设永远不会发生任何冲突(尽管我知道理论上可能存在)。
除了正文之外,我还应该对哪些标头建立索引?或者更一般地说,在散列之前我应该对消息的内存副本应用哪些转换?
我应该忽略“ReSent-*:”标头吗?我应该将断行标题加入到单行标题中并删除无关的空格吗?
(我想根据某个头而不是 Message-ID 标头对消息进行索引的原因是 Message-ID 标头的格式不统一。)
I want to store and index all of my historical e-mail and news as individual message files, using some computed hash code based on the message body+headers. Then I'll index on other things as well -- for searching.
For the primary index key, my thought is to use SHA-1 for the hash algorithm and assume that there will never be any collisions (although I know that there theoretically could be).
Besides the body, what headers should I index? Or more generally, what transformations should I apply to an in-memory copy of the message prior to hashing?
Should I ignore "ReSent-*:" headers? Should I join line-broken headers into single-line headers and remove extraneous whitespace?
(The reason I want to index the messages based on some head instead of on the Message-ID header is because Message-ID headers aren't uniformly formatted.)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您应该精确地散列构成消息唯一性的内容。如果两条消息可能因存在“ReSent-*:”标头而有所不同,但仍必须被视为“相同”消息,则这些标头不得成为散列内容的一部分。同样,如果相同的消息在标头语法中可能不同,那么您应该规范标头语法。仅当输入的每一位都完全相同时,诸如 SHA-1 之类的哈希函数才会返回相同的输出。
现在,如果使用 Message-ID 对您来说已经足够了,除了格式问题之外,还有一个简单的方法:只需对 Message-ID 进行哈希处理即可。散列消息 ID 将具有您可以索引的常规、固定大小、随机格式。
You should hash precisely that which constitutes uniqueness of the message. If two messages may differ by the presence of "ReSent-*:" headers but still must be considered to be the "same" message, then those headers must not be part of what is hashed. Similarly, if equal messages may differ in header syntax then you should normalize header syntax. Hash functions such as SHA-1 return the same output only if the input is eaxctly the same, every single bit of it.
Now if using Message-IDs are just enough for you, save for the formatting issue, then there is a simple way: just hash the Message-IDs. A hashed Message-ID will have your regular, fixed-size, randomized format on which you can index.