合并列表中的重复项? - 问题比看起来更复杂

发布于 2024-08-18 20:39:46 字数 890 浏览 6 评论 0原文

因此,我在数据库 (MySql) 中有一个巨大的条目列表,

我在创建 Web 应用程序时使用 PythonDjango

这是我正在使用的基本 Django 模型:

class DJ(models.Model):
    alias = models.CharField(max_length=255)
    #other fields...

在我的数据库中现在重复

例如。 以上&超越超越超越DJ 超越超越磁盘笑话,...

这是一个问题...因为它在我的数据库以及我的应用程序中炸出了一个大洞。


我相信其他人也遇到过这个问题并思考过。

我的想法如下:

  • 创建一组规则,以便无法创建新条目?

    <块引用> 例如。 “DJ 超越” 不能 创建是因为“Above & Beyond” 位于 数据库

  • 以某种方式将这些别名相互关联?

    <块引用> 例如。将“DJ 超越”“超越”


我真的不知道如何继续下去,即使有人可以给我指出一个方向会很有帮助的。

任何帮助将非常感激!谢谢你们。

So I have a huge list of entries in a DB (MySql)

I'm using Python and Django in the creation of my web application.

This is the base Django model I'm using:

class DJ(models.Model):
    alias = models.CharField(max_length=255)
    #other fields...

In my DB I have now duplicates

eg. Above & Beyond, Above and Beyond, Above Beyond, DJ Above and Beyond,
Disk Jokey Above and Beyond, ...

This is a problem... as it blows a big hole in my DB and therefore my application.


I'm sure other people have encountered this problem and thought about it.

My ideas are the following:

  • Create a set of rules so a new entry cannot be created?

    eg. "DJ Above and Beyond" cannot be
    created because "Above & Beyond" is in
    the DB

  • Relate these aliases to each other somehow?

    eg. relate "DJ Above and Beyond" to "Above & Beyond"


I have literally no clue how to go on about this, even if someone could point me into a direction that would be very helpful.

Any help would be very much appreciated! Thank you guys.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

清欢 2024-08-25 20:39:46

我想你可以根据 Levenshtein 距离 做一些事情,但没有真正的方法可以自动执行此操作- 无需创建相当复杂的基于规则的系统。

除非您可以定义一个规则系统来计算任何 xy 是否 xy 的重复项>,你将不得不以一种模糊的、人性化的方式来处理这个问题。

Stack Overflow 有一个相当不错的方法来处理这个问题 - 根据 Levenshtein 距离(也许还有某种规则引擎)之类的东西,警告用户是否有重复的东西可能,然后允许如果其他用户忽略警告,您的用户会将内容合并为重复项。

I guess you could do something based on Levenshtein distance, but there's no real way to do this automatically - without creating a fairly complex rules-based system.

Unless you can define a rules system that can work out for any x and y whether x is a duplicate of y, you're going to have to deal with this in a fuzzy, human way.

Stack Overflow has a fairly decent way of dealing with this - warn users if something may be a duplicate, based on something like Levenshtein distance (and perhaps some kind of rules engine), and then allow a subset of your users to merge things as duplicates if other users ignore the warnings.

滥情稳全场 2024-08-25 20:39:46

从您给出的示例来看,听起来您遇到的问题更多的是自然语言问题,而不是精确匹配问题。鉴于自然语言匹配本质上是不精确的,您不太可能想出完美的解决方案。

  • 字符串距离实际上不起作用,因为算法上接近的字符串在语义上可能并不接近(例如“DJ Below & Beyond”应该匹配“Above and Beyond”,但不匹配“DJ Below & Beyond 2”,后者在 Levenshtein 中更接近) 自然语言解析的一些廉价替代方案
  • soundex,它将通过语音进行匹配,以及词干,它删除前缀/后缀以标准化词干,我想你可以创建一个链接。词根列表,但这也不是非常准确,
  • “这是您想要输入的内容之一吗?”
  • 如果这是一个用户交互程序,您可以向用户回显“未遂事件”,例如 以某种方式规范化条目,以便不同的条目映射到相同的规范化值(例如大小写规范化、“&”->“And”等,上述一些建议可能是朝着这个方向迈出的一步)以查找未遂事件或将多个输入映射到单个值。

补充一点,我的经验仅适用于英语,例如,英语 PorterStemmer 无法识别您输入的法语标题。

From the examples you give, it sounds like you have more a natural language problem than an exact matching problem. Given that natural language matching is inexact by nature you're unlikely to come up with a perfect solution.

  • String distance doesn't really work, as strings that are algorithmically close may not be semantically close (e.g. "DJ Above & Beyond" should match "Above and Beyond" but not "DJ Above & Beyond 2" which is closer in Levenshtein distance.
  • Some cheap alternatives to natural language parsing are soundex, which will match by phonetic sounds, and Stemming, which removes prefixes/suffixes to normalize on word stems. I suppose you could create a linked list of word roots, but this wouldn't be terribly accurate either.
  • If this is a User-interacting program, you could echo "near misses" to the user, e.g. "Is one of these what you meant to enter?"
  • You could normalize the entries in some way so that different entries map to the same normalized value (e.g. case normalize, "&" -> "And", etc, etc. which some of the above suggestions might be a step towards) to find near misses or map multiple inputs to a single value.

Add the caveat that my experience only applies to English, e.g. an english PorterStemmer won't recognize the one French title you put in there.

疯到世界奔溃 2024-08-25 20:39:46

我认为这更多的是一个社会问题而不是编程问题。像这样的自然语言处理的任何编程解决方案都会有缺陷且容易出错。很难区分那些接近但与您正在谈论的不需要的重复项完全不同的事物。

正如 Dominic 提到的,Stack Overflow 的标记系统是一个非常好的模型。它向用户提供提示,鼓励他们在适当的情况下使用现有标签(用户键入时出现下拉列表),它允许受信任的用户重新标记单个问题,并允许版主进行批量重新标记。

这确实是一个必须有人直接参与的过程。

I think this is more of a social problem than a programming problem. Any sort of programatic solution to natural language processing like this is going to be buggy and error prone. It's very hard to distinguish things that are close, but legitimately different from the sort of undesired duplicates that you're talking about.

As Dominic mentioned, Stack Overflow's tagging system is a pretty good model for this. It provides cues to the user that encourage them to use existing tags if appropriate (drop down lists as the user types), it allows trusted users to retag individual questions, and it allows moderators to do mass retags.

This is really a process that has to have a person directly involved.

小瓶盖 2024-08-25 20:39:46

这不是一个完整的解决方案,但我有一个想法:

class DJ(models.Model):
    #other fields, no alias!

class DJAlias(models.Model):
    dj = models.ForeignKey(DJ)

这将允许您为同一个 DJ 拥有多个别名。

但您仍然需要找到一种正确的方法来确保将别名添加到正确的 dj。请参阅多米尼加 帖子。

但是,如果您将一个别名与指向一位 DJ 的其他几个别名进行比较,算法可能会工作得更好。

This is not a complete solutions but one thought I had:

class DJ(models.Model):
    #other fields, no alias!

class DJAlias(models.Model):
    dj = models.ForeignKey(DJ)

This would allow you to have several Aliases for the same dj.

But still you will need to find a proper way to ensure the aliases are added to the right dj. See Dominics post.

But if you check an alias against several other aliases pointing to one dj, the algorithms might work better.

懷念過去 2024-08-25 20:39:46

您可以尝试仅针对此实例解决此问题(将“&”替换为“&”,将“DJ”替换为“Disk笑话”或忽略“DJ”等)。如果你的桌子只包含 DJ,你可以设置一堆类似的规则。
如果您的表格包含更多不同的内容,您将不得不采用更具结构性的方法。你能给出你的数据集的样本吗?

You could try to solve this problem for this instance only (replacing the "&" with "&" and "DJ" with "Disk jokey" or ignore "DJ" etc..). If your table only contains DJ's you could set up a bunch of rules like those.
If your table contains more diverse stuff you will have to go with a more structural approach. Could you give a sample of your dataset?

爱的那么颓废 2024-08-25 20:39:46

首先,编程任务(如上所述的 NLP 等)当然很有趣。但正如前面提到的,想要完美这一点就有点矫枉过正了。

但另一种观点如前所述(“社交”),谁输入数据,谁查看数据,数据应该多长以及正确程度如何?所以这是一个命名约定问题,让我想起了伟大的项目 musicbrainz.org - 如果您的网站“正常工作”还是您更喜欢遵循标准,在后一种情况下,我会沿着 mb 项目定位自己 - 如果您还没有'我没有这样做,也没有听说过。
IE。请参阅此处的以上和超越:他们定义了别名,他们用它来匹配用户搜索。
http://musicbrainz.org/show/artist/aliases.html?artistid= 58438
另请查看 wiki 中的 Artist_Alias 页面。

数据模型值得一看,甚至还有几个用于同步数据的 API 绑定,也是用 Python 编写的。

First of all of course the programming task (NLP etc. as mentioned) is interesting. But as mentioned it's overkill aiming to perfect that.

But the other view is as mentioned ("social"), who enters the data, who views it, how long and how correct should it be? So it's a naming convention issue and reminds me to the great project musicbrainz.org - should your site "just work" or do you prefer to go along standards, in latter case i would orient myself along the mb project - in case you haven't done that and not heard of it.
ie. see here for Above & Beyond: they have on alias defined, they use it to match user searches.
http://musicbrainz.org/show/artist/aliases.html?artistid=58438
check out also the Artist_Alias page in the wiki.

The data model is worth a look and there are even several API bindings to sync data, also in python.

甜宝宝 2024-08-25 20:39:46

如何更改模型,使“别名”成为其他表的键列表,如下所示(跳过“the”、“and”等小词):
1 =>多于;
2=>超过;
3=>磁盘;
4=>笑话;

然后,当您想要插入新记录时,只需检查标题中有多少重要单词已在表中并与当前现有的模型实体匹配。如果超过 50%(例如),也许您有一个巧合,您可以向访问者显示它们的列表并询问“您是指其中的一些吗”。

How about changing the model so "alias" to be list of keys to other table that looks like this (skipping small words like "the", "and", etc.):
1 => Above;
2 => Beyond;
3 => Disk;
4 => Jokey;

Then when you want to insert new record just check how many of the significant words from the title are already in the table and match currently existing model entities. If more than 50% (for example) maybe you have a coincidence and you can show list of them to the visitor and asking "do you mean some of this one".

蓝天白云 2024-08-25 20:39:46

看起来 fuzzywuzzy 非常适合您的需求。

这篇文章解释了它设置的原因,其中非常符合您的要求 - 基本上,处理两个不同事物的命名略有不同的情况:

我们最始终令人沮丧的问题之一是试图弄清楚两个门票列表是否针对同一现实生活中的活动(也就是说,没有寻求我们实习生大军的帮助)。
...
为了实现这一目标,我们建立了一个“模糊”字符串匹配例程库来帮助我们。

Looks like fuzzywuzzy is a perfect match to your needs.

This article explains the reason it was set up, which very closely matches your requirements -- basically, to handle situations in which two different things were named slightly differently:

One of our most consistently frustrating issues is trying to figure out whether two ticket listings are for the same real-life event (that is, without enlisting the help of our army of interns).

To achieve this, we've built up a library of "fuzzy" string matching routines to help us along.

海拔太高太耀眼 2024-08-25 20:39:46

如果您只需要艺术家姓名或一般与媒体相关的姓名,那么最好使用 last.fm 或 echonest 的 API,因为它们已经拥有庞大的规则集和庞大的数据库可供选择。

If you're only after artist names or generally media related names it might be much better to just use the API of last.fm or echonest as they already have a huge rule set and a huge database to settle on.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文