用于字符串相似度的 Python 摘要/哈希

发布于 2024-12-26 22:00:47 字数 569 浏览 0 评论 0原文

我正在寻找一种算法,可以从较长的字符串生成短(fx 16 个字符(不重要))哈希码/摘要。

主要要求是几乎相同的字符串应该产生相同的摘要。FX

2 几乎相同的邮件:

嗨,马丁。这是给您的一些垃圾邮件。 =>啊啊啊啊啊啊啊啊

嗨博。这里有一些...垃圾邮件给您。问候 EFG。 => AAAA AAAA AAAA AAAA

返回相同的数字(或几乎相同),但作为不同的邮件:

Hello Finn。这是一封测试邮件。 => CCCC CCCC CCCC CCCC

将返回不同的摘要。

该算法将成为垃圾邮件过滤器的一部分。过滤器将记住来自确定为垃圾邮件的邮件的摘要。如果相同的摘要出现在有疑问的邮件中,则相同的摘要将导致过滤器增加垃圾邮件分数。

我了解 Levenshtein,但它要求我预先了解字符串。在这种情况下我没有这些信息。我可以拥有这些信息,但这需要过滤器来存储所有垃圾邮件并检查每封邮件,这将是一个非常缓慢的过程。

也许一些松散的压缩算法加上两者之间的 Levenshtein 距离的计算可以起作用。

任何指示表示赞赏。

I'm looking for an algorithm which can generate a short (fx 16 chars (not important) hashcode/digest from a longer string.

The main requirement is that strings which is almost identical should result in the same digest.

Fx 2 almost identical mail:

Hi Martin. Here are some ... spam for you. Regards XYZ.
=> AAAA AAAA AAAA AAAA

Hi Bo. Here are some ... spam for you. Regards EFG.
=> AAAA AAAA AAAA AAAA

returns the same diges (or almost the same), where as a different mail:

Hello Finn. This is a test mail.
=> CCCC CCCC CCCC CCCC

will return a different digest.

This algorithm would be part of a spam filter. The filter will remember digests from mails which it is certain is spam. If the same digest shows up in mails where it is in doubt, the identical digest will cause the filter to increase the spamscore.

I know about Levenshtein, but it requires me to know the strings up front. In this situation i do not have this information. I could have this information, but that would require the filter for store all spam e-mail and check against each one, which would be a very slow process.

Maybe some loose compression algorithm coupled with a calc of the Levenshtein distance between the two could work.

Any pointers appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

§对你不离不弃 2025-01-02 22:00:47

看起来您想要局部敏感哈希。考虑使用 minhash 或 shingling。 Rajaraman 和 Rajaraman 对此都有很好的解释。 Ullman 的书,挖掘海量数据集。您会在 python 搜索博客中找到上述关键字的大量简短实现。

似乎还有其他方法(我对此不太了解),但这可能会让您感兴趣,因为它们是专门为垃圾邮件量身定制的,特别是 nilsimsa 哈希:

It looks like you want locality-sensitive hashing. Consider using minhash or shingling. There's a great explanation of both in Rajaraman & Ullman's book, Mining Massive Datasets. You'll find numerous, short implementations in python searching blogs for the keywords above.

There seem to be other approaches to this (that I don't know much about), but that may be of interest to you since they are specially tailored for spam messages, in particular the nilsimsa hash:

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文