是否有一种可以容忍微小差异的哈希算法?
我正在做一些网络爬行类型的工作,在网页中查找某些术语并找到它们在页面上的位置,然后将其缓存以供以后使用。我希望能够定期检查页面是否有任何重大更改。像 md5 这样的东西可以通过简单地将当前日期和时间放在页面上来阻止。
有没有适用于这样的事情的哈希算法?
I'm doing some web crawling type stuff where I'm looking for certain terms in webpages and finding their location on the page, and then caching it for later use. I'd like to be able to check the page periodically for any major changes. Something like md5 can be foiled by simply putting the current date and time on the page.
Are there any hashing algorithms that work for something like this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
进行文档相似度的常见方法是 shingling ,这比哈希更复杂一些。另请参阅内容定义的分块以获取拆分文档的方法。
几年前我读过一篇关于使用 布隆过滤器 进行相似性检测的论文。 使用布隆过滤器优化 Web 搜索结果。这是一个有趣的想法,但我从未抽出时间去尝试。
A common way to do document similarity is shingling, which is somewhat more involved than hashing. Also look into content defined chunking for a way to split up the document.
I read a paper a few years back about using Bloom filters for similarity detection. Using Bloom Filters to Refine Web Search Results. It's an interesting idea, but I never got around to experimenting with it.
这可能是使用Levenshtein 距离度量的好地方,它量化了所需的编辑量将一个序列转换为另一个序列。
这种方法的缺点是您需要保留每个页面的全文,以便稍后进行比较。另一方面,使用基于哈希的方法,您只需存储某种小的计算值,不需要以前的全文进行比较。
您还可以尝试某种混合方法 - 让散列算法告诉您已进行任何更改,并将其用作触发器来检索文档的存档副本以进行更严格的(Levenshtein)比较。
This might be a good place to use the Levenshtein distance metric, which quantifies the amount of editing required to transform one sequence into another.
The drawback of this approach is that you'd need to keep the full text of each page so that you could compare them later. With a hash-based approach, on the other hand, you simply store some sort of small computed value and don't require the previous full text for comparison.
You also might try some sort of hybrid approach--let a hashing algorithm tell you that any change has been made, and use it as a trigger to retrieve an archival copy of the document for more rigorous (Levenshtein) comparison.
http://www.phash.org/ 对图像做了类似的操作。要点:拍摄一张图像,对其进行模糊处理,将其转换为灰度,进行离散余弦变换,然后仅查看结果的左上象限(重要信息所在的位置)。然后为每个小于平均值的值记录 0,为每个大于平均值的值记录 1。对于小的改变来说,结果相当不错。
最小散列是另一种可能性。查找文本中的特征并将其记录为值。连接所有这些值以形成哈希字符串。
对于上述两种情况,请使用有利点树,以便您可以搜索近距离命中。
http://www.phash.org/ did something like this for images. The jist: Take an image, blur it, convert it to greyscale, do a discrete cosine transform, and look at just the upper left quadrant of the result (where the important information is). Then record a 0 for each value less than the average and 1 for each value more than the average. The result is pretty good for small changes.
Min-Hashing is another possibility. Find features in your text and record them as a value. Concatenate all those values to make a hash string.
For both of the above, use a vantage point tree so that you can search for near-hits.
很遗憾地说,哈希算法是精确的。没有人能够容忍微小的差异。你应该采取另一种方法。
I am sorry to say, but hash algorithms are precisely. Theres none capable of be tolerant of minor differences. You should take another approach.