如何将 MYSQL 中的公司名称与 PHP 进行模糊匹配以实现自动完成?

发布于 2024-07-10 13:42:36 字数 677 浏览 11 评论 0原文

我的用户将通过剪切和粘贴导入包含公司名称的大字符串。

我有一个现有且不断增长的公司名称 MYSQL 数据库,每个数据库都有一个唯一的 company_id。

我希望能够解析字符串并为每个用户输入的公司名称分配一个模糊匹配。

现在,仅仅进行直接的字符串匹配也很慢。 ** Soundex 索引会更快吗? 如何在用户打字时为他们提供一些选项? **

例如,有人写道:

Microsoft       -> Microsoft
Bare Essentials -> Bare Escentuals
Polycom, Inc.   -> Polycom

我发现以下线程似乎与此问题类似,但发布者尚未批准,我不确定他们的用例是否适用:

如何在大字符串中找到字符串的最佳模糊匹配数据库

在 Java 中匹配不精确的公司名称

My users will import through cut and paste a large string that will contain company names.

I have an existing and growing MYSQL database of companies names, each with a unique company_id.

I want to be able to parse through the string and assign to each of the user-inputed company names a fuzzy match.

Right now, just doing a straight-up string match, is also slow. ** Will Soundex indexing be faster? How can I give the user some options as they are typing? **

For example, someone writes:

Microsoft       -> Microsoft
Bare Essentials -> Bare Escentuals
Polycom, Inc.   -> Polycom

I have found the following threads that seem similar to this question, but the poster has not approved and I'm not sure if their use-case is applicable:

How to find best fuzzy match for a string in a large string database

Matching inexact company names in Java

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

雅心素梦 2024-07-17 13:42:36

您可以从使用 SOUNDEX() 开始,这可能适用于您需要什么(我想象了一个自动建议框,其中包含用户正在键入的内容的现有替代方案)。

SOUNDEX() 的缺点是:

  • 它无法区分较长的字符串。 仅考虑前几个字符,末尾分歧的较长字符串会生成相同的 SOUNDEX 值,
  • 事实上第一个字母必须相同,否则您将无法轻松找到匹配项。 SQL Server 有 DIFFERENCE() 函数来告诉你两个 SOUNDEX 值相距多少,但我认为 MySQL 没有内置此类功能。
  • 对于 MySQL,至少根据 文档,SOUNDEX 对于 unicode 输入已损坏

示例:

SELECT SOUNDEX('Microsoft')
SELECT SOUNDEX('Microsift')
SELECT SOUNDEX('Microsift Corporation')
SELECT SOUNDEX('Microsift Subsidary')

/* all of these return 'M262' */

对于更高级的需求,我认为您需要查看在两个字符串的 Levenshtein 距离(也称为“编辑距离”)处并使用阈值。 这是更复杂(=更慢)的解决方案,但它具有更大的灵活性。

主要缺点是,您需要两个字符串来计算它们之间的距离。 使用 SOUNDEX,您可以将预先计算的 SOUNDEX 存储在表中,并对其进行比较/排序/分组/过滤。 通过 Levenshtein 距离,您可能会发现“Microsoft”和“Nzcrosoft”之间的差异仅为 2,但需要更多时间才能得出该结果。

无论如何,MySQL 的 Levenshtein 距离函数示例可以在 codejanitor.com 中找到:Levenshtein Distance作为 MySQL 存储函数(2007 年 2 月 10 日)

You can start with using SOUNDEX(), this will probably do for what you need (I picture an auto-suggestion box of already-existing alternatives for what the user is typing).

The drawbacks of SOUNDEX() are:

  • its inability to differentiate longer strings. Only the first few characters are taken into account, longer strings that diverge at the end generate the same SOUNDEX value
  • the fact the the first letter must be the same or you won't find a match easily. SQL Server has DIFFERENCE() function to tell you how much two SOUNDEX values are apart, but I think MySQL has nothing of that kind built in.
  • for MySQL, at least according to the docs, SOUNDEX is broken for unicode input

Example:

SELECT SOUNDEX('Microsoft')
SELECT SOUNDEX('Microsift')
SELECT SOUNDEX('Microsift Corporation')
SELECT SOUNDEX('Microsift Subsidary')

/* all of these return 'M262' */

For more advanced needs, I think you need to look at the Levenshtein distance (also called "edit distance") of two strings and work with a threshold. This is the more complex (=slower) solution, but it allows for greater flexibility.

Main drawback is, that you need both strings to calculate the distance between them. With SOUNDEX you can store a pre-calculated SOUNDEX in your table and compare/sort/group/filter on that. With the Levenshtein distance, you might find that the difference between "Microsoft" and "Nzcrosoft" is only 2, but it will take a lot more time to come to that result.

In any case, an example Levenshtein distance function for MySQL can be found at codejanitor.com: Levenshtein Distance as a MySQL Stored Function (Feb. 10th, 2007).

场罚期间 2024-07-17 13:42:36

SOUNDEX 是一个不错的算法,但该主题最近取得了进展。 另一种算法被创建,称为 Metaphone,后来被修改为 Double Metaphone 算法。 我个人使用过双变音位的 java apache commons 实现,它是可定制的且准确的。

他们在维基百科页面上也有许多其他语言的实现。 这个问题已经得到解答,但是如果您发现应用程序中出现 SOUNDEX 的任何已识别问题,很高兴知道还有其他选择。 有时它可以为两个完全不同的单词生成相同的代码。 双变音位的诞生就是为了帮助解决这个问题。

从维基百科窃取:http://en.wikipedia.org/wiki/Soundex

作为对缺陷的回应
Soundex 算法,劳伦斯·飞利浦
开发了 Metaphone 算法
相同的目的。 后来飞利浦
对 Metaphone 进行了改进,
他将其称为“Double-Metaphone”。
双变音位包括很多
比其更大的编码规则集
前任,处理的子集
非拉丁字符,并返回
主要和次要编码
考虑到不同的发音
英语中的单个单词。

在双变音位页面的底部,他们有各种编程语言的实现:http ://en.wikipedia.org/wiki/Double-Metaphone

Python 和 MySQL 实现: https://github.com/AtomBoy/double-metaphone

SOUNDEX is an OK algorithm for this, but there have been recent advances on this topic. Another algorithm was created called the Metaphone, and it was later revised to a Double Metaphone algorithm. I have personally used the java apache commons implementation of double metaphone and it is customizable and accurate.

They have implementations in lots of other languages on the wikipedia page for it, too. This question has been answered, but should you find any of the identified problems with the SOUNDEX appearing in your application, it's nice to know there are options. Sometimes it can generate the same code for two really different words. Double metaphone was created to help take care of that problem.

Stolen from wikipedia: http://en.wikipedia.org/wiki/Soundex

As a response to deficiencies in the
Soundex algorithm, Lawrence Philips
developed the Metaphone algorithm for
the same purpose. Philips later
developed an improvement to Metaphone,
which he called Double-Metaphone.
Double-Metaphone includes a much
larger encoding rule set than its
predecessor, handles a subset of
non-Latin characters, and returns a
primary and a secondary encoding to
account for different pronunciations
of a single word in English.

At the bottom of the double metaphone page, they have the implementations of it for all kinds of programming languages: http://en.wikipedia.org/wiki/Double-Metaphone

Python & MySQL implementation: https://github.com/AtomBoy/double-metaphone

飘逸的'云 2024-07-17 13:42:36

首先,我想补充一点,在使用任何形式的语音/模糊匹配算法时都应该非常小心,因为这种逻辑就是模糊或者更简单地说; 可能不准确。 当用于匹配公司名称时尤其如此。

一个好的方法是从其他数据中寻求佐证,例如地址信息、邮政编码、电话号码、地理坐标等。这将有助于确认您的数据准确匹配的概率。

与 B2B 数据匹配相关的一系列问题太多,无法在这里解决,我已经写了更多关于 我的博客中的公司名称匹配(也是 更新文章),但总的来说,关键问题是:

  • 查看整个字符串是没有帮助的,因为最重要的部分
    公司名称不一定是公司的开头
    姓名。 即“宝洁公司”或“美国联邦
    保留 '
  • 缩写在公司名称中很常见,例如 HP、GM、GE、P&G、
    D&B 等。
  • 一些公司故意将其名称拼写错误,作为
    他们的品牌并使自己与其他公司区分开来。

匹配精确数据很容易,但匹配非精确数据可能会花费更多时间,我建议您应该考虑如何验证非精确匹配,以确保这些数据具有可接受的质量。

在我们建立 Match2Lists.com 之前,我们常常花费大量时间来验证模糊匹配。 在 Match2Lists 中,我们采用了强大的可视化工具,使我们能够审查非精确匹配,事实证明,这在匹配验证方面是真正的游戏规则改变者,降低了我们的成本,并使我们能够更快地提供结果。

祝你好运!

Firstly, I would like to add that you should be very careful when using any form of Phonetic/Fuzzy Matching Algorithm, as this kind of logic is exactly that, Fuzzy or to put it more simply; potentially inaccurate. Especially true when used for matching company names.

A good approach is to seek corroboration from other data, such as address information, postal codes, tel numbers, Geo Coordinates etc. This will help confirm the probability of your data being accurately matched.

There are a whole range of issues related to B2B Data Matching too many to be addressed here, I have written more about Company Name Matching in my blog (also an updated article), but in summary the key issues are:

  • Looking at the whole string is unhelpful as the most important part
    of a Company Name is not necessarily at the beginning of the Company
    Name. i.e. ‘The Proctor and Gamble Company’ or ‘United States Federal
    Reserve ‘
  • Abbreviations are common place in Company Names i.e. HP, GM, GE, P&G,
    D&B etc..
  • Some companies deliberately spell their names incorrectly as part of
    their branding and to differentiate themselves from other companies.

Matching exact data is easy, but matching non-exact data can be much more time consuming and I would suggest that you should consider how you will be validating the non-exact matches to ensure these are of acceptable quality.

Before we built Match2Lists.com, we used to spend an unhealthy amount of time validating fuzzy matches. In Match2Lists we incorporated a powerful Visualisation tool enabling us to review non-exact matches, this proved to be a real game changer in terms of match validation, reducing our costs and enabling us to deliver results much more quickly.

Best of Luck!!

┼── 2024-07-17 13:42:36

这是 mysql 和 php 中 soundex 函数的 php 讨论的链接。 我将从那里开始,然后扩展到您其他不太明确的需求。

您的参考引用了 Levenshtein 匹配方法。 两个问题。 1.它更适合测量两个已知单词之间的差异,而不是用于搜索。 2. 它讨论了一种解决方案,该解决方案旨在更多地检测校对错误(使用“Levenshtien”表示“Levenshtein”),而不是拼写错误(用户不知道如何拼写,说“Levenshtein”并输入“Levinstein”我通常将其与在书中查找短语而不是数据库中的关键值联系起来:

作为对评论的回应--

  1. 您至少可以让用户将公司名称放入多个文本框中吗?使用明确的名称分隔符(例如反斜杠); 3. 省略冠词(“The”)和通用缩写(或者您可以过滤这些); 4. 去掉空格并进行匹配(因此 Micro Soft => microsoft, Bare Essentials => bareessentials); 5. 过滤掉标点符号; 6. 对单词进行“OR”搜索(“bare”或“essentials”)——有时人们会不可避免地忽略其中一个

。使用用户的反馈循环。

Here's a link to the php discussion of the soundex functions in mysql and php. I'd start from there, then expand into your other not-so-well-defined requirements.

Your reference references the Levenshtein methodology for matching. Two problems. 1. It's more appropriate for measuring the difference between two known words, not for searching. 2. It discusses a solution designed more to detect things like proofing errors (using "Levenshtien" for "Levenshtein") rather than spelling errors (where the user doesn't know how to spell, say "Levenshtein" and types in "Levinstein". I usually associate it with looking for a phrase in a book rather than a key value in a database.

EDIT: In response to comment--

  1. Can you at least get the users to put the company names into multiple text boxes; 2. or use an unambigous name delimiter (say backslash); 3. leave out articles ("The") and generic abbreviations (or you can filter for these); 4. Squoosh the spaces out and match for that also (so Micro Soft => microsoft, Bare Essentials => bareessentials); 5. Filter out punctuation; 6. Do "OR" searches on words ("bare" OR "essentials") - people will inevitably leave one or the other out sometimes.

Test like mad and use the feedback loop from users.

淡看悲欢离合 2024-07-17 13:42:36

虽然问题询问如何在 MySQL 中进行模糊搜索,但我建议考虑使用单独的模糊搜索(也称为拼写错误容忍)引擎来完成此任务。 以下是一些需要考虑的搜索引擎:

  • ElasticSearch(开源,有大量功能,因此操作起来也很复杂)
  • Algolia(专有,但有很棒的文档,并且超级容易启动和运行)
  • Typesense(开源,提供与 Algolia 相同的模糊搜索功能)

Though the question asks about how to do fuzzy searches in MySQL, I'd recommend considering using a separate fuzzy search (aka typo tolerant) engine to accomplish this. Here are some search engines to consider:

  • ElasticSearch (Open source, has a ton of features, and so is also complex to operate)
  • Algolia (Proprietary, but has great docs and super easy to get up and running)
  • Typesense (Open source, provides the same fuzzy search-as-you-type feature as Algolia)
动听の歌 2024-07-17 13:42:36

此答案会导致使用 2 或 3 个或更多字符的输入对几乎所有实体进行索引查找。

基本上,创建一个包含 2 列(单词和键)的新表。 对包含要模糊搜索的列的原始表运行一个过程。 此过程将从原始列中提取每个单词,并将这些单词与原始键一起写入单词表。 在此过程中,应丢弃诸如“the”、“and”等常见单词。

然后,我们在单词表上创建多个索引,如下所示...

  • 单词上的普通小写索引 + 键
  • 第二到第五个字符的索引 + 键
  • 第三到第六个字符的索引 + 键

    或者,在单词列上创建 SOUNDEX() 索引。

一旦完成,我们将接受任何用户输入并使用普通的 word = input 或 LIKE input% 进行搜索。 我们从不执行 LIKE % 输入,因为我们总是在前 3 个字符中寻找匹配项,这些字符都已编入索引。

如果您的原始表很大,您可以按字母表块对单词表进行分区,以确保用户的输入立即缩小到候选行。

This answer results in indexed lookup of almost any entity using input of 2 or 3 characters or more.

Basically, create a new table with 2 columns, word and key. Run a process on the original table containing the column to be fuzzy searched. This process will extract every individual word from the original column and write these words to the word table along with the original key. During this process, commonly occurring words like 'the','and', etc should be discarded.

We then create several indices on the word table, as follows...

  • A normal, lowercase index on word + key
  • An index on the 2nd through 5th character + key
  • An index on the 3rd through 6th character + key

    Alternately, create a SOUNDEX() index on the word column.

Once this is in place, we take any user input and search using normal word = input or LIKE input%. We never do a LIKE %input as we are always looking for a match on any of the first 3 characters, which are all indexed.

If your original table is massive, you could partition the word table by chunks of the alphabet to ensure the user's input is being narrowed down to candidate rows immediately.

情感失落者 2024-07-17 13:42:36

在服务器端使用受信任且经过良好测试的拼写检查库查询之前检查拼写是否错误,然后对原始文本和第一个建议的正确拼写进行简单查询(如果拼写检查确定它是拼写错误)。

您可以为任何值得使用的拼写检查库创建自定义词典,您可能需要这样做才能匹配更晦涩的公司名称。

匹配两个简单字符串比对整个表进行编辑距离计算要快得多。 MySQL 不太适合这个。

我最近解决了一个类似的问题,并浪费了大量时间摆弄算法,所以我真的希望有更多的人警告不要在 MySQL 中这样做。

Check if it's spelled wrong before querying using a trusted and well tested spell checking library on the server side, then do a simple query for the original text AND the first suggested correct spelling (if spell check determined it was misspelled).

You can create custom dictionaries for any spell check library worth using, which you may need to do for matching more obscure company names.

It's way faster to match against two simple strings than it is to do a Levenshtein distance calculation against an entire table. MySQL is not well suited for this.

I tackled a similar problem recently and wasted a lot of time fiddling around with algorithms, so I really wish there had been more people out there cautioning against doing this in MySQL.

本王不退位尔等都是臣 2024-07-17 13:42:36

模糊匹配的最佳函数是levenshtein。 它传统上由拼写检查器使用,所以这可能是正确的选择。 这里有一个 UDF:http://joshdrew.com/

使用 levenshtein 的缺点是它会赢扩展性不太好。 更好的想法可能是将整个表转储到拼写检查器自定义字典文件中,并从应用程序层而不是数据库层执行建议。

the best function for fuzzy matching is levenshtein. it's traditionally used by spell checkers, so that might be the way to go. there's a UDF for it available here: http://joshdrew.com/

the downside to using levenshtein is that it won't scale very well. a better idea might be to dump the whole table in to a spell checker custom dictionary file and do the suggestion from your application tier instead of the database tier.

请止步禁区 2024-07-17 13:42:36

之前可能有人建议过,但为什么不将数据转储到 Excel 并使用模糊匹配 Excel 插件。 这将给出从 0 到 1 的分数(1 代表 100%)。

我对数据库中保存的业务合作伙伴(公司)数据执行了此操作。
下载最新的英国公司之家数据并根据该数据进行评分。

对于 ROW 数据,它更加复杂,因为我们必须执行更多的手动过程。

Probably been suggested before but why not dump the data out to Excel and use the Fuzzy Match Excel plugin. This will give a score from 0 to 1 (1 being 100%).

I did this for business partner (company) data that was held in a database.
Download the latest UK Companies House data and score against that.

For ROW data its more complex as we had to do a more manual process.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文