如何将 MYSQL 中的公司名称与 PHP 进行模糊匹配以实现自动完成？

发布于 2024-09-06 20:12:22 字数 664 浏览 11 评论 0原文

我的用户将通过剪切和粘贴导入包含公司名称的大字符串。

我有一个现有且不断增长的公司名称 MYSQL 数据库，每个数据库都有一个唯一的 company_id。

我希望能够解析字符串并为每个用户输入的公司名称分配一个模糊匹配。

现在，仅仅进行直接的字符串匹配也很慢。 ** Soundex 索引会更快吗？如何在用户打字时为他们提供一些选项？ **

例如，有人写道：

Microsoft       -> Microsoft
Bare Essentials -> Bare Escentuals
Polycom, Inc.   -> Polycom

我发现以下线程似乎与此问题类似，但发布者尚未批准，我不确定他们的用例是否适用：

如何在大字符串中找到字符串的最佳模糊匹配数据库

在 Java 中匹配不精确的公司名称

原文

My users will import through cut and paste a large string that will contain company names.

I have an existing and growing MYSQL database of companies names, each with a unique company_id.

I want to be able to parse through the string and assign to each of the user-inputed company names a fuzzy match.

Right now, just doing a straight-up string match, is also slow. ** Will Soundex indexing be faster? How can I give the user some options as they are typing? **

For example, someone writes:

Microsoft       -> Microsoft
Bare Essentials -> Bare Escentuals
Polycom, Inc.   -> Polycom

I have found the following threads that seem similar to this question, but the poster has not approved and I'm not sure if their use-case is applicable:

How to find best fuzzy match for a string in a large string database

Matching inexact company names in Java

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

茶花眉 2024-09-13 20:12:22

您可以从使用 SOUNDEX() 开始，这可能适用于您需要什么（我想象了一个自动建议框，其中包含用户正在键入的内容的现有替代方案）。

SOUNDEX() 的缺点是：

它无法区分较长的字符串。仅考虑前几个字符，末尾分歧的较长字符串会生成相同的 SOUNDEX 值，
事实上第一个字母必须相同，否则您将无法轻松找到匹配项。 SQL Server 有 DIFFERENCE() 函数来告诉你两个 SOUNDEX 值相距多少，但我认为 MySQL 没有内置此类功能。
对于 MySQL，至少根据文档，SOUNDEX 对于 unicode 输入已损坏

示例：

SELECT SOUNDEX('Microsoft')
SELECT SOUNDEX('Microsift')
SELECT SOUNDEX('Microsift Corporation')
SELECT SOUNDEX('Microsift Subsidary')

/* all of these return 'M262' */

对于更高级的需求，我认为您需要查看在两个字符串的 Levenshtein 距离（也称为“编辑距离”）处并使用阈值。这是更复杂（=更慢）的解决方案，但它具有更大的灵活性。

主要缺点是，您需要两个字符串来计算它们之间的距离。使用 SOUNDEX，您可以将预先计算的 SOUNDEX 存储在表中，并对其进行比较/排序/分组/过滤。通过 Levenshtein 距离，您可能会发现“Microsoft”和“Nzcrosoft”之间的差异仅为 2，但需要更多时间才能得出该结果。

无论如何，MySQL 的 Levenshtein 距离函数示例可以在 codejanitor.com 中找到：Levenshtein Distance作为 MySQL 存储函数（2007 年 2 月 10 日）。

You can start with using SOUNDEX(), this will probably do for what you need (I picture an auto-suggestion box of already-existing alternatives for what the user is typing).

The drawbacks of SOUNDEX() are:

its inability to differentiate longer strings. Only the first few characters are taken into account, longer strings that diverge at the end generate the same SOUNDEX value
the fact the the first letter must be the same or you won't find a match easily. SQL Server has DIFFERENCE() function to tell you how much two SOUNDEX values are apart, but I think MySQL has nothing of that kind built in.
for MySQL, at least according to the docs, SOUNDEX is broken for unicode input

Example:

SELECT SOUNDEX('Microsoft')
SELECT SOUNDEX('Microsift')
SELECT SOUNDEX('Microsift Corporation')
SELECT SOUNDEX('Microsift Subsidary')

/* all of these return 'M262' */

For more advanced needs, I think you need to look at the Levenshtein distance (also called "edit distance") of two strings and work with a threshold. This is the more complex (=slower) solution, but it allows for greater flexibility.

Main drawback is, that you need both strings to calculate the distance between them. With SOUNDEX you can store a pre-calculated SOUNDEX in your table and compare/sort/group/filter on that. With the Levenshtein distance, you might find that the difference between "Microsoft" and "Nzcrosoft" is only 2, but it will take a lot more time to come to that result.

In any case, an example Levenshtein distance function for MySQL can be found at codejanitor.com: Levenshtein Distance as a MySQL Stored Function (Feb. 10th, 2007).

回复收藏 0 原文

本宫微胖 2024-09-13 20:12:22

SOUNDEX 是一个不错的算法，但该主题最近取得了进展。另一种算法被创建，称为 Metaphone，后来被修改为 Double Metaphone 算法。我个人使用过双变音位的 java apache commons 实现，它是可定制的且准确的。

他们在维基百科页面上也有许多其他语言的实现。这个问题已经得到解答，但是如果您发现应用程序中出现 SOUNDEX 的任何已识别问题，很高兴知道还有其他选择。有时它可以为两个完全不同的单词生成相同的代码。双变音位的诞生就是为了帮助解决这个问题。

来自维基百科：http://en.wikipedia.org/wiki/Soundex

作为对缺陷的回应
Soundex 算法，劳伦斯·飞利浦
开发了 Metaphone 算法
相同的目的。后来飞利浦
对 Metaphone 进行了改进，
他将其称为“Double-Metaphone”。
双变音位包括很多
比其更大的编码规则集
前任，处理的子集
非拉丁字符，并返回
主要和次要编码
考虑到不同的发音
英语中的单个单词。

在双变音位页面的底部，他们有各种编程语言的实现： http://en.wikipedia.org/wiki/Double-Metaphone

Python 和MySQL 实现： https://github.com/AtomBoy/double-metaphone

回复收藏 0 原文

久隐师 2024-09-13 20:12:22

首先，我想补充一点，在使用任何形式的语音/模糊匹配算法时都应该非常小心，因为这种逻辑就是模糊或者更简单地说；可能不准确。当用于匹配公司名称时尤其如此。

一个好的方法是从其他数据中寻求佐证，例如地址信息、邮政编码、电话号码、地理坐标等。这将有助于确认您的数据准确匹配的概率。

与 B2B 数据匹配相关的一系列问题太多，无法在这里解决，我已经写了更多关于我的博客中的公司名称匹配（也是更新文章），但总的来说，关键问题是：

查看整个字符串是没有帮助的，因为最重要的部分
公司名称不一定是公司的开头
姓名。即“宝洁公司”或“美国联邦
保留 '
缩写在公司名称中很常见，例如 HP、GM、GE、P&G、
D&B 等。
一些公司故意将其名称拼写错误，作为
他们的品牌并使自己与其他公司区分开来。

匹配精确数据很容易，但匹配非精确数据可能会花费更多时间，我建议您应该考虑如何验证非精确匹配，以确保这些数据具有可接受的质量。

回复收藏 0 原文

邮友 2024-09-13 20:12:22

这是 mysql 和 php 中 soundex 函数的 php 讨论的链接。我将从那里开始，然后扩展到您其他不太明确的需求。

您的参考引用了 Levenshtein 匹配方法。两个问题。 1.它更适合测量两个已知单词之间的差异，而不是用于搜索。 2. 它讨论了一种解决方案，该解决方案旨在更多地检测校对错误（使用“Levenshtien”表示“Levenshtein”），而不是拼写错误（用户不知道如何拼写，说“Levenshtein”并输入“Levinstein”我通常将其与在书中查找短语而不是数据库中的关键值联系起来：

作为对评论的回应--

您至少可以让用户将公司名称放入多个文本框中吗？使用明确的名称分隔符（例如反斜杠）； 3. 省略冠词（“The”）和通用缩写（或者您可以过滤这些）； 4. 去掉空格并进行匹配（因此 Micro Soft => microsoft, Bare Essentials => bareessentials); 5. 过滤掉标点符号； 6. 对单词进行“OR”搜索（“bare”或“essentials”）——有时人们会不可避免地忽略其中一个

。使用用户的反馈循环。

回复收藏 0 原文