如何确定两个相似的乐队名称是否代表同一乐队?

发布于 2024-08-14 14:26:13 字数 574 浏览 1 评论 0原文

我目前正在开展一个项目,该项目要求我将我们的乐队和场地数据库与许多外部服务相匹配。

基本上我正在寻找一些关于确定两个名字是否相同的最佳方法的方向。例如:

  • 我们的数据库场地名称 - “The Pig and Whistle”
  • 服务 1 - “Pig and Whistle”
  • 服务 2 - “The Pig & Whistle”

我认为主要区别在于缺少“the”或使用“&”而不是“and”,但也可能有稍微不同的拼写和不同顺序的单词。

在这种情况下通常使用哪些算法/技术,我是否需要过滤干扰词或进行某种拼写检查类型匹配?

你见过 c# 中类似的例子吗?

更新:如果有人对 ac# 示例感兴趣,您可以通过执行 Google 代码搜索 Levenshtein 距离

I'm currently working on a project that requires me to match our database of Bands and venues with a number of external services.

Basically I'm looking for some direction on the best method for determining if two names are the same. For Example:

  • Our database venue name - "The Pig and Whistle"
  • service 1 - "Pig and Whistle"
  • service 2 - "The Pig & Whistle"
  • etc etc

I think the main differences are going to be things like missing "the" or using "&" instead of "and" but there could also be things like slightly different spelling and words in different orders.

What algorithms/techniques are commonly used in this situation, do I need to filter noise words or do some sort of spell check type match?

Have you seen any examples of something simlar in c#?

UPDATE: In case anyone is interested in a c# example there is a heap you can access by doing a google code search for Levenshtein distance

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

冷心人i 2024-08-21 14:26:13

执行此操作的规范(可能也是最简单)方法是测量 Levenshtein 距离 位于两个字符串之间。如果距离相对于字符串的大小较小,则可能是同一字符串。请注意,如果您必须比较许多非常小的字符串,则很难判断它们是否相同。对于较长的字符串效果更好。

更聪明的方法可能是比较两个字符串之间的编辑距离,但为更明显的转换分配零距离,例如“and”/“&”、“Snoop Doggy Dogg”/“Snoop”等。

The canonical (and probably the easiest) way to do this is to measure the Levenshtein distance between the two strings. If the distance is small relative to the size of the string, it's probably the same string. Note that if you have to compare a lot of very small strings it'll be harder to tell whether they're the same or not. It works better with longer strings.

A smarter approach might be to compare the Levenshtein distance between the two strings but to assign a distance of zero to the more obvious transformations, like "and"/"&", "Snoop Doggy Dogg"/"Snoop", etc.

魔法唧唧 2024-08-21 14:26:13

我不久前做了类似的事情,我使用了 Discogs 数据库(这是公共领域),它也跟踪艺术家别名;

您可以:

  • 使用 API 调用 (namevariations 字段)。
  • 下载每月数据转储 (*_artists.xml.gz) & ;将其导入您的数据库中。它包含相同的数据,但显然要快得多。

Levenshtein distance)解决方案相比,此方法的优点之一是您会得到更少的错误匹配。
例如,Ryan AdamsBryan Adams 的得分为 2,相当不错(越低匹配越好,Pig 和WhistlePig & Whistle 的得分为 3),但他们显然是不同的人。

虽然您可以制定更智能的算法(例如,它也会考虑字符串长度),但使用别名 DB 要简单得多。更少的错误电话;实施此操作后,我可以完全删除其他答案和建议中建议的解决方案。有更好的比赛。

I did something like this a while ago, I used the the Discogs database (which is public domain), which also tracks artist aliases;

You can either:

  • Use an API call (namevariations field).
  • Download the monthly data dumps (*_artists.xml.gz) & import it in your database. This contains the same data, but is obviously a lot faster.

One advantage of this over the Levenshtein distance) solution is that you'll get a lot less false matches.
For example, Ryan Adams and Bryan Adams have a score of 2, which is quite good (lower is better matches, Pig and Whistle and Pig & Whistle has a score of 3), yet they're obviously different people.

While you could make a smarter algorithm (which also looks at string length, for example), using the alias DB is a lot simpler & less error-phone; after implementing this, I could completely remove the solution that was suggested in the other answer & had better matches.

自由如风 2024-08-21 14:26:13

soundex 也可能有用

soundex may also be useful

童话 2024-08-21 14:26:13

在生物信息学中,我们一直用它来比较 DNA 或蛋白质序列。

有很多算法,您可能想看看全局对齐

在这方面,Needleman-Wunsch 算法可能就是您所寻求的。

如果您有特别长的重复字符串需要比较,您可能还需要考虑启发式搜索,例如 BLAST。

In bioinformatics we use this to compare DNA- or protein sequences all the time.

There are plenty of algorithms, you probably want to look at global alignments.

In this respect the Needleman-Wunsch algorithm is probably what you seek.

If you have particularly long recurring strings to compare you might also want to consider heuristic searches like BLAST.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文