合并列表中的重复项? - 问题比看起来更复杂
因此,我在数据库 (MySql) 中有一个巨大的条目列表,
我在创建 Web 应用程序时使用 Python 和 Django。
这是我正在使用的基本 Django 模型:
class DJ(models.Model):
alias = models.CharField(max_length=255)
#other fields...
在我的数据库中现在重复
例如。 以上&超越、超越、超越、DJ 超越、 超越磁盘笑话,...
这是一个问题...因为它在我的数据库以及我的应用程序中炸出了一个大洞。
我相信其他人也遇到过这个问题并思考过。
我的想法如下:
创建一组规则,以便无法创建新条目?
<块引用> 例如。 “DJ 超越” 不能 创建是因为“Above & Beyond” 位于 数据库以某种方式将这些别名相互关联?
<块引用> 例如。将“DJ 超越” 与“超越”
我真的不知道如何继续下去,即使有人可以给我指出一个方向会很有帮助的。
任何帮助将非常感激!谢谢你们。
So I have a huge list of entries in a DB (MySql)
I'm using Python and Django in the creation of my web application.
This is the base Django model I'm using:
class DJ(models.Model):
alias = models.CharField(max_length=255)
#other fields...
In my DB I have now duplicates
eg. Above & Beyond, Above and Beyond, Above Beyond, DJ Above and Beyond,
Disk Jokey Above and Beyond, ...
This is a problem... as it blows a big hole in my DB and therefore my application.
I'm sure other people have encountered this problem and thought about it.
My ideas are the following:
Create a set of rules so a new entry cannot be created?
eg. "DJ Above and Beyond" cannot be
created because "Above & Beyond" is in
the DBRelate these aliases to each other somehow?
eg. relate "DJ Above and Beyond" to "Above & Beyond"
I have literally no clue how to go on about this, even if someone could point me into a direction that would be very helpful.
Any help would be very much appreciated! Thank you guys.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
我想你可以根据 Levenshtein 距离 做一些事情,但没有真正的方法可以自动执行此操作- 无需创建相当复杂的基于规则的系统。
除非您可以定义一个规则系统来计算任何
x
和y
是否x
是y
的重复项>,你将不得不以一种模糊的、人性化的方式来处理这个问题。Stack Overflow 有一个相当不错的方法来处理这个问题 - 根据 Levenshtein 距离(也许还有某种规则引擎)之类的东西,警告用户是否有重复的东西可能,然后允许如果其他用户忽略警告,您的用户会将内容合并为重复项。
I guess you could do something based on Levenshtein distance, but there's no real way to do this automatically - without creating a fairly complex rules-based system.
Unless you can define a rules system that can work out for any
x
andy
whetherx
is a duplicate ofy
, you're going to have to deal with this in a fuzzy, human way.Stack Overflow has a fairly decent way of dealing with this - warn users if something may be a duplicate, based on something like Levenshtein distance (and perhaps some kind of rules engine), and then allow a subset of your users to merge things as duplicates if other users ignore the warnings.
从您给出的示例来看,听起来您遇到的问题更多的是自然语言问题,而不是精确匹配问题。鉴于自然语言匹配本质上是不精确的,您不太可能想出完美的解决方案。
补充一点,我的经验仅适用于英语,例如,英语 PorterStemmer 无法识别您输入的法语标题。
From the examples you give, it sounds like you have more a natural language problem than an exact matching problem. Given that natural language matching is inexact by nature you're unlikely to come up with a perfect solution.
Add the caveat that my experience only applies to English, e.g. an english PorterStemmer won't recognize the one French title you put in there.
我认为这更多的是一个社会问题而不是编程问题。像这样的自然语言处理的任何编程解决方案都会有缺陷且容易出错。很难区分那些接近但与您正在谈论的不需要的重复项完全不同的事物。
正如 Dominic 提到的,Stack Overflow 的标记系统是一个非常好的模型。它向用户提供提示,鼓励他们在适当的情况下使用现有标签(用户键入时出现下拉列表),它允许受信任的用户重新标记单个问题,并允许版主进行批量重新标记。
这确实是一个必须有人直接参与的过程。
I think this is more of a social problem than a programming problem. Any sort of programatic solution to natural language processing like this is going to be buggy and error prone. It's very hard to distinguish things that are close, but legitimately different from the sort of undesired duplicates that you're talking about.
As Dominic mentioned, Stack Overflow's tagging system is a pretty good model for this. It provides cues to the user that encourage them to use existing tags if appropriate (drop down lists as the user types), it allows trusted users to retag individual questions, and it allows moderators to do mass retags.
This is really a process that has to have a person directly involved.
这不是一个完整的解决方案,但我有一个想法:
这将允许您为同一个 DJ 拥有多个别名。
但您仍然需要找到一种正确的方法来确保将别名添加到正确的 dj。请参阅多米尼加 帖子。
但是,如果您将一个别名与指向一位 DJ 的其他几个别名进行比较,算法可能会工作得更好。
This is not a complete solutions but one thought I had:
This would allow you to have several Aliases for the same dj.
But still you will need to find a proper way to ensure the aliases are added to the right dj. See Dominics post.
But if you check an alias against several other aliases pointing to one dj, the algorithms might work better.
您可以尝试仅针对此实例解决此问题(将“&”替换为“&”,将“DJ”替换为“Disk笑话”或忽略“DJ”等)。如果你的桌子只包含 DJ,你可以设置一堆类似的规则。
如果您的表格包含更多不同的内容,您将不得不采用更具结构性的方法。你能给出你的数据集的样本吗?
You could try to solve this problem for this instance only (replacing the "&" with "&" and "DJ" with "Disk jokey" or ignore "DJ" etc..). If your table only contains DJ's you could set up a bunch of rules like those.
If your table contains more diverse stuff you will have to go with a more structural approach. Could you give a sample of your dataset?
首先,编程任务(如上所述的 NLP 等)当然很有趣。但正如前面提到的,想要完美这一点就有点矫枉过正了。
但另一种观点如前所述(“社交”),谁输入数据,谁查看数据,数据应该多长以及正确程度如何?所以这是一个命名约定问题,让我想起了伟大的项目 musicbrainz.org - 如果您的网站“正常工作”还是您更喜欢遵循标准,在后一种情况下,我会沿着 mb 项目定位自己 - 如果您还没有'我没有这样做,也没有听说过。
IE。请参阅此处的以上和超越:他们定义了别名,他们用它来匹配用户搜索。
http://musicbrainz.org/show/artist/aliases.html?artistid= 58438
另请查看 wiki 中的 Artist_Alias 页面。
数据模型值得一看,甚至还有几个用于同步数据的 API 绑定,也是用 Python 编写的。
First of all of course the programming task (NLP etc. as mentioned) is interesting. But as mentioned it's overkill aiming to perfect that.
But the other view is as mentioned ("social"), who enters the data, who views it, how long and how correct should it be? So it's a naming convention issue and reminds me to the great project musicbrainz.org - should your site "just work" or do you prefer to go along standards, in latter case i would orient myself along the mb project - in case you haven't done that and not heard of it.
ie. see here for Above & Beyond: they have on alias defined, they use it to match user searches.
http://musicbrainz.org/show/artist/aliases.html?artistid=58438
check out also the Artist_Alias page in the wiki.
The data model is worth a look and there are even several API bindings to sync data, also in python.
如何更改模型,使“别名”成为其他表的键列表,如下所示(跳过“the”、“and”等小词):
1 =>多于;
2=>超过;
3=>磁盘;
4=>笑话;
然后,当您想要插入新记录时,只需检查标题中有多少重要单词已在表中并与当前现有的模型实体匹配。如果超过 50%(例如),也许您有一个巧合,您可以向访问者显示它们的列表并询问“您是指其中的一些吗”。
How about changing the model so "alias" to be list of keys to other table that looks like this (skipping small words like "the", "and", etc.):
1 => Above;
2 => Beyond;
3 => Disk;
4 => Jokey;
Then when you want to insert new record just check how many of the significant words from the title are already in the table and match currently existing model entities. If more than 50% (for example) maybe you have a coincidence and you can show list of them to the visitor and asking "do you mean some of this one".
看起来 fuzzywuzzy 非常适合您的需求。
这篇文章解释了它设置的原因,其中非常符合您的要求 - 基本上,处理两个不同事物的命名略有不同的情况:
Looks like fuzzywuzzy is a perfect match to your needs.
This article explains the reason it was set up, which very closely matches your requirements -- basically, to handle situations in which two different things were named slightly differently:
如果您只需要艺术家姓名或一般与媒体相关的姓名,那么最好使用 last.fm 或 echonest 的 API,因为它们已经拥有庞大的规则集和庞大的数据库可供选择。
If you're only after artist names or generally media related names it might be much better to just use the API of last.fm or echonest as they already have a huge rule set and a huge database to settle on.