检测大型数据集中重复/相似的文本?

发布于 2024-09-30 05:34:35 字数 141 浏览 10 评论 0原文

我有一个包含数千条记录的大型数据库。每次用户发布他的信息时,我都需要知道是否已经有相同/相似的记录。有没有算法或开源实现来解决这个问题?

我们用的是中文,“相似”的意思是记录内容最相同,可能80%-100%是相同的。每条记录不会太大,大约2k-6k字节

I have a large database with thousands records. Every time a user post his information I need to know if there is already the same/similar record. Are there any algorithms or open source implementations to solve this problem?

We're using Chinese, and what 'similar' means is the records have most identical content, might be 80%-100% are the same. Each record will not be too big, about 2k-6k bytes

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

無處可尋 2024-10-07 05:34:35

这个答案是一个非常高的复杂度类(最坏的情况是五次,预期的情况是第一次验证数据库是四次,然后是四次/三次来添加记录,)所以它不能很好地扩展,不幸的是没有我现在能想到一个更好的答案。

该算法称为Ratcliff-Obershelp算法,它是用python的difflib。该算法本身是三次方时间最坏情况和二次方预期情况。然后,您必须对每个可能的记录对执行此操作,这是二次的。当然,当添加一条记录时,这只是线性的。

编辑:抱歉,我误读了文档,difflib 只是二次方,而不是三次方。使用它而不是其他算法。

This answer is of a very high complexity class (worst case it's quintic, expected case it's quartic to verify your database the first time, then quartic/cubic to add a record,) so it doesn't scale well, unfortunately there isn't a much better answer that I can think of right now.

The algorithm is called the Ratcliff-Obershelp algorithm, It's implemented in python's difflib. The algorithm itself is cubic time worst case and quadratic expected. Then you have to do that for each possible pair of records, which is quadratic. When adding a record, of course, this is only linear.

EDIT: Sorry, I misread the documentation, difflib is quadratic only, rather than cubic. Use it rather than the other algorithm.

月下凄凉 2024-10-07 05:34:35

看看 shngle-min-hash 技术。这是演示文稿可以帮助您。

Look at shngle-min-hash techniques. Here is a presentation that could help you.

携余温的黄昏 2024-10-07 05:34:35

我用来做类似事情的一种方法是根据单词统计信息构建通常的搜索索引,然后使用新项目,就好像它是针对该索引的搜索一样 - 如果搜索中顶部项目的分数太高高则表示新商品太相似。毫无疑问,一些标准文本搜索库可以用于此目的,尽管如果只有几千条记录,那么构建自己的搜索库就相当简单了。

One approach I have used to do something similar is to construct a search index in the usual based on word statistics and then use the new item as if it was a search against that index - if the score for the top item in the search is too high then the new item is too similar. No doubt some of the standard text search libraries could be used for this although if it is only a few thousands of records it is pretty trivial to build your own.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文