如何将 MYSQL 中的公司名称与 PHP 进行模糊匹配以实现自动完成?
我的用户将通过剪切和粘贴导入包含公司名称的大字符串。
我有一个现有且不断增长的公司名称 MYSQL 数据库,每个数据库都有一个唯一的 company_id。
我希望能够解析字符串并为每个用户输入的公司名称分配一个模糊匹配。
现在,仅仅进行直接的字符串匹配也很慢。 ** Soundex 索引会更快吗?如何在用户打字时为他们提供一些选项? **
例如,有人写道:
Microsoft -> Microsoft Bare Essentials -> Bare Escentuals Polycom, Inc. -> Polycom
我发现以下线程似乎与此问题类似,但发布者尚未批准,我不确定他们的用例是否适用:
My users will import through cut and paste a large string that will contain company names.
I have an existing and growing MYSQL database of companies names, each with a unique company_id.
I want to be able to parse through the string and assign to each of the user-inputed company names a fuzzy match.
Right now, just doing a straight-up string match, is also slow. ** Will Soundex indexing be faster? How can I give the user some options as they are typing? **
For example, someone writes:
Microsoft -> Microsoft Bare Essentials -> Bare Escentuals Polycom, Inc. -> Polycom
I have found the following threads that seem similar to this question, but the poster has not approved and I'm not sure if their use-case is applicable:
How to find best fuzzy match for a string in a large string database
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
您可以从使用
SOUNDEX()
开始,这可能适用于您需要什么(我想象了一个自动建议框,其中包含用户正在键入的内容的现有替代方案)。SOUNDEX()
的缺点是:示例:
对于更高级的需求,我认为您需要查看在两个字符串的 Levenshtein 距离(也称为“编辑距离”)处并使用阈值。这是更复杂(=更慢)的解决方案,但它具有更大的灵活性。
主要缺点是,您需要两个字符串来计算它们之间的距离。使用 SOUNDEX,您可以将预先计算的 SOUNDEX 存储在表中,并对其进行比较/排序/分组/过滤。通过 Levenshtein 距离,您可能会发现“Microsoft”和“Nzcrosoft”之间的差异仅为 2,但需要更多时间才能得出该结果。
无论如何,MySQL 的 Levenshtein 距离函数示例可以在 codejanitor.com 中找到:Levenshtein Distance作为 MySQL 存储函数(2007 年 2 月 10 日)。
You can start with using
SOUNDEX()
, this will probably do for what you need (I picture an auto-suggestion box of already-existing alternatives for what the user is typing).The drawbacks of
SOUNDEX()
are:Example:
For more advanced needs, I think you need to look at the Levenshtein distance (also called "edit distance") of two strings and work with a threshold. This is the more complex (=slower) solution, but it allows for greater flexibility.
Main drawback is, that you need both strings to calculate the distance between them. With SOUNDEX you can store a pre-calculated SOUNDEX in your table and compare/sort/group/filter on that. With the Levenshtein distance, you might find that the difference between "Microsoft" and "Nzcrosoft" is only 2, but it will take a lot more time to come to that result.
In any case, an example Levenshtein distance function for MySQL can be found at codejanitor.com: Levenshtein Distance as a MySQL Stored Function (Feb. 10th, 2007).
SOUNDEX 是一个不错的算法,但该主题最近取得了进展。另一种算法被创建,称为 Metaphone,后来被修改为 Double Metaphone 算法。我个人使用过双变音位的 java apache commons 实现,它是可定制的且准确的。
他们在维基百科页面上也有许多其他语言的实现。这个问题已经得到解答,但是如果您发现应用程序中出现 SOUNDEX 的任何已识别问题,很高兴知道还有其他选择。有时它可以为两个完全不同的单词生成相同的代码。双变音位的诞生就是为了帮助解决这个问题。
来自维基百科:http://en.wikipedia.org/wiki/Soundex
在双变音位页面的底部,他们有各种编程语言的实现: http://en.wikipedia.org/wiki/Double-Metaphone
Python 和MySQL 实现: https://github.com/AtomBoy/double-metaphone
SOUNDEX is an OK algorithm for this, but there have been recent advances on this topic. Another algorithm was created called the Metaphone, and it was later revised to a Double Metaphone algorithm. I have personally used the java apache commons implementation of double metaphone and it is customizable and accurate.
They have implementations in lots of other languages on the wikipedia page for it, too. This question has been answered, but should you find any of the identified problems with the SOUNDEX appearing in your application, it's nice to know there are options. Sometimes it can generate the same code for two really different words. Double metaphone was created to help take care of that problem.
From wikipedia: http://en.wikipedia.org/wiki/Soundex
At the bottom of the double metaphone page, they have the implementations of it for all kinds of programming languages: http://en.wikipedia.org/wiki/Double-Metaphone
Python & MySQL implementation: https://github.com/AtomBoy/double-metaphone
首先,我想补充一点,在使用任何形式的语音/模糊匹配算法时都应该非常小心,因为这种逻辑就是模糊或者更简单地说;可能不准确。当用于匹配公司名称时尤其如此。
一个好的方法是从其他数据中寻求佐证,例如地址信息、邮政编码、电话号码、地理坐标等。这将有助于确认您的数据准确匹配的概率。
与 B2B 数据匹配相关的一系列问题太多,无法在这里解决,我已经写了更多关于 我的博客中的公司名称匹配(也是 更新文章),但总的来说,关键问题是:
公司名称不一定是公司的开头
姓名。即“宝洁公司”或“美国联邦
保留 '
D&B 等。
他们的品牌并使自己与其他公司区分开来。
匹配精确数据很容易,但匹配非精确数据可能会花费更多时间,我建议您应该考虑如何验证非精确匹配,以确保这些数据具有可接受的质量。
Firstly, I would like to add that you should be very careful when using any form of Phonetic/Fuzzy Matching Algorithm, as this kind of logic is exactly that, Fuzzy or to put it more simply; potentially inaccurate. Especially true when used for matching company names.
A good approach is to seek corroboration from other data, such as address information, postal codes, tel numbers, Geo Coordinates etc. This will help confirm the probability of your data being accurately matched.
There are a whole range of issues related to B2B Data Matching too many to be addressed here, I have written more about Company Name Matching in my blog (also an updated article), but in summary the key issues are:
of a Company Name is not necessarily at the beginning of the Company
Name. i.e. ‘The Proctor and Gamble Company’ or ‘United States Federal
Reserve ‘
D&B etc..
their branding and to differentiate themselves from other companies.
Matching exact data is easy, but matching non-exact data can be much more time consuming and I would suggest that you should consider how you will be validating the non-exact matches to ensure these are of acceptable quality.
这是 mysql 和 php 中 soundex 函数的 php 讨论的链接。我将从那里开始,然后扩展到您其他不太明确的需求。
您的参考引用了 Levenshtein 匹配方法。两个问题。 1.它更适合测量两个已知单词之间的差异,而不是用于搜索。 2. 它讨论了一种解决方案,该解决方案旨在更多地检测校对错误(使用“Levenshtien”表示“Levenshtein”),而不是拼写错误(用户不知道如何拼写,说“Levenshtein”并输入“Levinstein”我通常将其与在书中查找短语而不是数据库中的关键值联系起来:
作为对评论的回应--
。使用用户的反馈循环。
Here's a link to the php discussion of the soundex functions in mysql and php. I'd start from there, then expand into your other not-so-well-defined requirements.
Your reference references the Levenshtein methodology for matching. Two problems. 1. It's more appropriate for measuring the difference between two known words, not for searching. 2. It discusses a solution designed more to detect things like proofing errors (using "Levenshtien" for "Levenshtein") rather than spelling errors (where the user doesn't know how to spell, say "Levenshtein" and types in "Levinstein". I usually associate it with looking for a phrase in a book rather than a key value in a database.
EDIT: In response to comment--
Test like mad and use the feedback loop from users.
虽然问题询问如何在 MySQL 中进行模糊搜索,但我建议考虑使用单独的模糊搜索(也称为拼写错误容忍)引擎来完成此任务。以下是一些需要考虑的搜索引擎:
Though the question asks about how to do fuzzy searches in MySQL, I'd recommend considering using a separate fuzzy search (aka typo tolerant) engine to accomplish this. Here are some search engines to consider:
此答案会导致使用 2 或 3 个或更多字符的输入对几乎所有实体进行索引查找。
基本上,创建一个包含 2 列(单词和键)的新表。对包含要模糊搜索的列的原始表运行一个过程。此过程将从原始列中提取每个单词,并将这些单词与原始键一起写入单词表。在此过程中,应丢弃诸如“the”、“and”等常见单词。
然后,我们在单词表上创建多个索引,如下所示...
第三到第六个字符的索引 + 键
或者,在单词列上创建 SOUNDEX() 索引。
一旦完成,我们将接受任何用户输入并使用普通的 word = input 或 LIKE input% 进行搜索。我们从不执行 LIKE % 输入,因为我们总是在前 3 个字符中寻找匹配项,这些字符都已编入索引。
如果您的原始表很大,您可以按字母表块对单词表进行分区,以确保用户的输入立即缩小到候选行。
This answer results in indexed lookup of almost any entity using input of 2 or 3 characters or more.
Basically, create a new table with 2 columns, word and key. Run a process on the original table containing the column to be fuzzy searched. This process will extract every individual word from the original column and write these words to the word table along with the original key. During this process, commonly occurring words like 'the','and', etc should be discarded.
We then create several indices on the word table, as follows...
An index on the 3rd through 6th character + key
Alternately, create a SOUNDEX() index on the word column.
Once this is in place, we take any user input and search using normal word = input or LIKE input%. We never do a LIKE %input as we are always looking for a match on any of the first 3 characters, which are all indexed.
If your original table is massive, you could partition the word table by chunks of the alphabet to ensure the user's input is being narrowed down to candidate rows immediately.
在服务器端使用受信任且经过良好测试的拼写检查库在查询之前检查拼写是否错误,然后对原始文本和第一个建议的正确拼写进行简单查询(如果拼写检查确定它是拼写错误)。
您可以为任何值得使用的拼写检查库创建自定义词典,您可能需要这样做才能匹配更晦涩的公司名称。
匹配两个简单字符串比对整个表进行编辑距离计算要快得多。 MySQL 不太适合这个。
我最近解决了一个类似的问题,并浪费了大量时间摆弄算法,所以我真的希望有更多的人警告不要在 MySQL 中这样做。
Check if it's spelled wrong before querying using a trusted and well tested spell checking library on the server side, then do a simple query for the original text AND the first suggested correct spelling (if spell check determined it was misspelled).
You can create custom dictionaries for any spell check library worth using, which you may need to do for matching more obscure company names.
It's way faster to match against two simple strings than it is to do a Levenshtein distance calculation against an entire table. MySQL is not well suited for this.
I tackled a similar problem recently and wasted a lot of time fiddling around with algorithms, so I really wish there had been more people out there cautioning against doing this in MySQL.
模糊匹配的最佳函数是levenshtein。它传统上由拼写检查器使用,所以这可能是正确的选择。这里有一个 UDF:http://joshdrew.com/
使用 levenshtein 的缺点是它会赢扩展性不太好。更好的想法可能是将整个表转储到拼写检查器自定义字典文件中,并从应用程序层而不是数据库层执行建议。
the best function for fuzzy matching is levenshtein. it's traditionally used by spell checkers, so that might be the way to go. there's a UDF for it available here: http://joshdrew.com/
the downside to using levenshtein is that it won't scale very well. a better idea might be to dump the whole table in to a spell checker custom dictionary file and do the suggestion from your application tier instead of the database tier.
之前可能有人建议过,但为什么不将数据转储到 Excel 并使用模糊匹配 Excel 插件。这将给出从 0 到 1 的分数(1 代表 100%)。
我对数据库中保存的业务合作伙伴(公司)数据执行了此操作。
下载最新的英国公司之家数据并根据该数据进行评分。
对于 ROW 数据,它更加复杂,因为我们必须执行更多的手动过程。
Probably been suggested before but why not dump the data out to Excel and use the Fuzzy Match Excel plugin. This will give a score from 0 to 1 (1 being 100%).
I did this for business partner (company) data that was held in a database.
Download the latest UK Companies House data and score against that.
For ROW data its more complex as we had to do a more manual process.