判断一个企业名称是否与另一个企业名称非常相似 - Python
我正在处理一个大型企业数据库。
我希望能够比较两个公司名称的相似性,看看它们是否可能重复。
以下是应测试重复概率很高的企业名称列表,有什么好的方法可以解决这个问题?
George Washington Middle Schl George Washington School Santa Fe East Inc Santa Fe East Chop't Creative Salad Co Chop't Creative Salad Company Manny and Olga's Pizza Manny's & Olga's Pizza Ray's Hell Burger Too Ray's Hell Burgers El Sol El Sol de America Olney Theatre Center for the Arts Olney Theatre 21 M Lounge 21M Lounge Holiday Inn Hotel Washington Holiday Inn Washington-Georgetown Residence Inn Washington,DC/Dupont Circle Residence Inn Marriott Dupont Circle Jimmy John's Gourmet Sandwiches Jimmy John's Omni Shoreham Hotel at Washington D.C. Omni Shoreham Hotel
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
我最近完成了一项类似的任务,尽管我是将新数据与数据库中的现有名称进行匹配,而不是在一组中查找重复项。名称匹配实际上是一项经过充分研究的任务,其中有许多因素超出了您在匹配通用字符串时考虑的因素。
首先,我建议您看一下 Raffo 和 Lhuillery 撰写的论文,如何玩“名称游戏”:比较不同启发式的专利检索。已发布的版本位于此处,并且可免费获取 PDF 此处。作者提供了一个很好的总结,比较了许多不同的匹配策略。他们考虑了三个阶段,称之为解析、匹配和过滤。
解析包括应用各种清理技术。一些示例:
在我的例子中,我将所有字母折叠为小写,将所有标点符号替换为空格,将重音字符替换为不带重音的对应项,删除了所有其他特殊字符,并删除了列表后面名称的开头和结尾的合法控制术语。
匹配是对解析后的名称进行比较。这可以是简单的字符串匹配、编辑距离、Soundex 或 Metaphone、组成名称的单词集的比较,或者字母或 n 元组的比较(长度为 的字母序列) n)。 n-gram 方法实际上对于名称来说非常好,因为它忽略了词序,对于诸如“示例部门”与“示例部门”之类的事情有很大帮助。事实上,使用像 Jaccard 索引 这样简单的东西来比较二元组(2-grams,字符对)是非常有效。与其他几个建议相比,编辑距离是名称匹配方面较差的方法之一。
在我的例子中,我分两步进行匹配,首先比较解析的名称是否相等,然后对剩余的二元组使用 Jaccard 索引。我没有实际计算所有名称对的所有 Jaccard 索引值,而是首先对给定大小的两组的 Jaccard 索引的最大可能值设置一个界限,并且仅在该上限足够高时计算 Jaccard 索引可能有用。大多数名称对仍然非常不同,以至于它们不匹配,但这极大地减少了比较的次数。
过滤是使用辅助数据来拒绝解析和匹配阶段的误报。一个简单的版本是查看匹配的名称是否对应于不同城市的企业,从而对应于不同的企业。该示例可以在匹配之前应用,作为一种预过滤。之后可能会应用更复杂或更耗时的检查。
我没有做太多过滤。我检查了各个国家/地区的公司,看看它们是否相同,仅此而已。数据中实际上并没有那么多可能性,一些时间限制排除了对其他数据进行广泛搜索以增强过滤的可能性,而且无论如何都计划进行手动检查。
I've recently done a similar task, although I was matching new data to existing names in a database, rather than looking for duplicates within one set. Name matching is actually a well-studied task, with a number of factors beyond what you'd consider for matching generic strings.
First, I'd recommend taking a look at a paper, How to play the “Names Game”: Patent retrieval comparing different heuristics by Raffo and Lhuillery. The published version is here, and a PDF is freely available here. The authors provide a nice summary, comparing a number of different matching strategies. They consider three stages, which they call parsing, matching, and filtering.
Parsing consists of applying various cleaning techniques. Some examples:
In my case, I folded all letters to lowercase, replaced all punctuation with whitespace, replaced accented characters by unaccented counterparts, removed all other special characters, and removed legal control terms from the beginning and ends of the names following a list.
Matching is the comparison of the parsed names. This could be simple string matching, edit distance, Soundex or Metaphone, comparison of the sets of words making up the names, or comparison of sets of letters or n-grams (letter sequences of length n). The n-gram approach is actually quite nice for names, as it ignores word order, helping a lot with things like "department of examples" vs. "examples department". In fact, comparing bigrams (2-grams, character pairs) using something simple like the Jaccard index is very effective. In contrast to several other suggestions, Levenshtein distance is one of the poorer approaches when it comes to name matching.
In my case, I did the matching in two steps, first with comparing the parsed names for equality and then using the Jaccard index for the sets of bigrams on the remaining. Rather than actually calculating all the Jaccard index values for all pairs of names, I first put a bound on the maximum possible value for the Jaccard index for two sets of given size, and only computed the Jaccard index if that upper bound was high enough to potentially be useful. Most of the name pairs were still dissimilar enough that they weren't matches, but it dramatically reduced the number of comparisons made.
Filtering is the use of auxiliary data to reject false positives from the parsing and matching stages. A simple version would be to see if matching names correspond to businesses in different cities, and thus different businesses. That example could be applied before matching, as a kind of pre-filtering. More complicated or time-consuming checks might be applied afterwards.
I didn't do much filtering. I checked the countries for the firms to see if they were the same, and that was it. There weren't really that many possibilities in the data, some time constraints ruled out any extensive search for additional data to augment the filtering, and there was a manual checking planned, anyway.
我想在已接受的优秀答案中添加一些示例。在 Python 2.7 中测试。
解析
让我们以这个奇怪的名字为例。
我们可以从删除法律控制条款(此处为 LLC)开始。为此,有一个很棒的 cleanco Python 库,它正是这样做的:
删除所有标点符号:
(对于 unicode 字符串,可以使用以下代码(源,regex):
使用 regex 将名称拆分为标记。 org/" rel="noreferrer">NLTK:
小写所有标记:
删除停用词。请注意,这可能会导致公司出现问题,例如
On Mars
将与Mars 错误匹配
,因为On
是一个停用词。我在这里不介绍重音字符和特殊字符(欢迎改进)。
匹配
现在,当我们将所有公司名称映射到标记时,我们希望找到匹配对。对于此任务,Jaccard(或 Jaro-Winkler)相似度比 Levenstein 更好,但仍然不够好,原因是它没有考虑名称中单词的重要性(就像 TF-IDF 一样)。因此,像“公司”这样的常见词对分数的影响与可能唯一标识公司名称的词一样大。
为了改进这一点,您可以使用此 很棒的一系列帖子(不是我的)。这是其中的一个代码示例:
使用它,您可以匹配相似度超过特定阈值的名称。作为一种更复杂的方法,您还可以采用多个分数(例如,这个唯一性分数、Jaccard 和 Jaro-Winkler)并使用一些标记数据训练二元分类模型,在给定多个分数的情况下,如果候选对是否匹配。有关此内容的更多信息可以在同一篇博客文章中找到。
I'd like to add some examples to the excellent accepted answer. Tested in Python 2.7.
Parsing
Let's use this odd name as an example.
We can start with removing legal control terms (here LLC). To do that, there is an awesome cleanco Python library, which does exactly that:
Remove all punctuation:
(for unicode strings, the following code works instead (source, regex):
Split the name into tokens using NLTK:
Lowercase all tokens:
Remove stop words. Note that it might cause problems with companies like
On Mars
will be incorrectly matched toMars
, becauseOn
is a stopword.I don't cover accented and special characters here (improvements welcome).
Matching
Now, when we have mapped all company names to tokens, we want to find the matching pairs. Arguably, Jaccard (or Jaro-Winkler) similarity is better than Levenstein for this task, but is still not good enough. The reason is that it does not take into account the importance of words in the name (like TF-IDF does). So common words like "Company" influence the score just as much as words that might uniquely identify company name.
To improve on that, you can use a name similarity trick suggested in this awesome series of posts (not mine). Here is a code example from it:
Using that, you can match names with similarity exceeding certain threshold. As a more complex approach, you can also take several scores (say, this uniqueness score, Jaccard and Jaro-Winkler) and train a binary classification model using some labeled data, which will, given a number of scores, output if the candidate pair is a match or not. More on this can be found in the same blog post.
有一个很棒的库用于搜索 python 的相似/模糊字符串:fuzzywuzzy。根据提到的 Levenshtein 距离测量,这是一个很好的包装库。
这里是如何分析你的名字的:
解决此类问题的另一种方法可能是 Elasticsearch,也支持模糊搜索。
There is great library for searching for similar/fuzzy strings for python: fuzzywuzzy. It's a nice wrapper library upon mentioned Levenshtein distance measuring.
Here how your names could be analysed:
Another way of solving such kind of problems could be Elasticsearch, which also supports fuzzy searches.
您可以使用编辑距离,它可用于测量两个序列之间的差异(基本上是编辑距离)。
Python 中的编辑距离
You could use the Levenshtein distance, which could be used to measure the difference between two sequences (basically an edit distance).
Levenshtein Distance in Python
这是对丹尼斯评论的一点更新。这个答案非常有帮助,他发布的链接也是如此,但我无法让它们立即工作。在尝试 Fuzzy Wuzzy 搜索后,我发现这给了我一组更好的答案。我有一个很大的商家列表,我只想将它们分组在一起。最终我将拥有一个可以用来尝试一些机器学习的桌子,但目前这需要付出很多努力。
我只需要稍微更新一下他的代码并添加一个函数来创建 tokens2Frequency 字典。原来的文章也没有这个,然后函数没有正确引用它。
这将为您提供最终的过滤名称列表以及它们匹配的其他名称。它的基本代码与其他帖子相同,只是填充了一些缺失的部分。这也是在 Python 3.8 中运行的
This a bit of an update to Dennis comment. That answer was really helpful as was the links he posted but I couldn't get them to work right off. After trying the Fuzzy Wuzzy search I found this gave me a bunch better set of answers. I have a large list of merchants and I just want to group them together. Eventually I'll have a table I can use to try some machine learning to play around with but for now this takes a lot of the effort out of it.
I only had to update his code a little bit and add a function to create the tokens2frequency dictionary. The original article didn't have that either and then the functions didn't reference it correctly.
This will give you a final filtered list of names and the other names they match up to. It's the same basic code as the other post just with a couple of missing pieces filled in. This also is run in Python 3.8
我搜索“python edit distance”,这个库是第一个结果: http://www .mindrot.org/projects/py-editdist/
另一个执行相同工作的 Python 库位于:http://pypi.python.org/pypi/python-Levenshtein/
一个 编辑距离表示仅通过简单的(通常是基于字符的)编辑操作将一个字符串转换为另一个字符串所需执行的工作量。每个操作(替换、删除、插入;有时转置)都有相关的成本,两个字符串之间的最小编辑距离是衡量两个字符串有多不同的度量。
在您的特定情况下,您可能想要对字符串进行排序,以便找到从较长到较短的距离,并减少对字符删除的惩罚(因为我发现在很多情况下,其中一个字符串几乎是另一个字符串的子字符串) 。所以删除不应该受到太多惩罚。
您还可以使用以下示例代码:http://norvig.com/spell- Correct.html
I searched for "python edit distance" and this library came as the first result: http://www.mindrot.org/projects/py-editdist/
Another Python library that does the same job is here: http://pypi.python.org/pypi/python-Levenshtein/
An edit distance represents the amount of work you need to carry out to convert one string to another by following only simple -- usually, character-based -- edit operations. Every operation (substition, deletion, insertion; sometimes transpose) has an associated cost and the minimum edit distance between two strings is a measure of how dissimilar the two are.
In your particular case you may want to order the strings so that you find the distance to go from the longer to the shorter and penalize character deletions less (because I see that in many cases one of the strings is almost a substring of the other). So deletion shouldn't be penalized a lot.
You could also make use of this sample code: http://norvig.com/spell-correct.html
考虑使用Diff-Match-Patch 库。您会对 Diff 过程感兴趣 - 在文本上应用 diff 可以让您很好地了解差异以及它们的编程表示。
Consider using the Diff-Match-Patch library. You'd be interested in the Diff process - applying a diff on your text can give you a good idea of the differences, along with a programmatic representation of them.
您可以做的就是用空格、逗号等分隔单词,然后计算它与另一个名称共有的单词数,并在将其视为“相似”之前添加单词阈值数。
另一种方法是做同样的事情,但是取出单词并为每个字符拼接它们。然后,对于每个单词,您需要比较如果在 x 个字符数量(或百分比)中以相同的顺序(从两侧)找到字母,那么您可以说该单词也相似。
例如:你有 sqre 和 square
然后你检查字符,发现 sqre 都是正方形并且顺序相同,那么它是一个相似的单词。
What you can do is separate the words by whitespaces, commas, etc. and then you you count the number of words it have in common with another name and you add a number of words thresold before it is considered "similar".
The other way is to do the same thing, but take the words and splice them for each caracters. Then for each words you need to compare if letters are found in the same order (from both sides) for an x amount of caracters (or percentage) then you can say that the word is similar too.
Ex: You have sqre and square
Then you check by caracters and find that sqre are all in square and in the same order, then it's a similar word.
基于 Levenshtein 距离的算法很好(并不完美),但它们的主要缺点是每次比较都非常慢,并且考虑到您必须比较每个可能的组合。
解决该问题的另一种方法是,使用嵌入或词袋将每个公司名称(经过一些清理和预先处理后)转换为数字向量。之后,您根据可用的方法应用无监督或有监督的 ML 方法。
The algorithms that are based on the Levenshtein distance are good (not perfect) but their main disadvantage is that they are very slow for each comparison and concerning the fact that you would have to compare every possible combination.
Another way of working out the problem would be, to use embedding or bag of words to transform each company name (after some cleaning and prepossessing ) into a vector of numbers. And after that you apply an unsupervised or supervised ML method depending on what is available.
我创建了 matchkraft (https://github.com/MatchKraft/matchkraft-python)。它在 fuzzy-wuzzy 之上工作,您可以在一个列表中模糊匹配公司名称。
它非常容易使用。下面是 python 中的一个例子:
I created matchkraft (https://github.com/MatchKraft/matchkraft-python). It works on top of fuzzy-wuzzy and you can fuzzy match company names in one list.
It is very easy to use. Here is an example in python: