超级模糊的名字检查?

发布于 2024-09-11 06:48:54 字数 176 浏览 12 评论 0 原文

我正在为内部 CRM 做一些工作。该公司当前的前端允许大量重复。我试图阻止最终用户输入同一个人,因为他们搜索的是“比尔·约翰逊”而不是“威廉·约翰逊”。因此,用户将输入有关其新客户的一些信息,我们将找到相似的名称(包括模糊名称),并将它们与我们数据库中已有的名称进行匹配,并询问他们是否意味着这些内容......这样的数据库或技术存在吗?

I'm working on some stuff for an in-house CRM. The company's current frontend allows for lots of duplicates. I'm trying to stop end-users from putting in the same person because they searched for 'Bill Johnson' and not 'William Johnson.' So the user will put in some information about their new customer and we'll find the similar names (including fuzzy names) and match them against what is already in our database and ask if they meant those things... Does such a database or technology exist?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

逆夏时光 2024-09-18 06:48:54

我在一个网站上实现了这样的功能。我在 PHP 中使用 double_metaphone() + levenstein() 。我为 dabatase 中的每个条目预先计算了一个 double_metaphone() ,我使用“metaphoned”搜索词的前 x 个字符的 SELECT 进行查找。

然后我根据返回的结果进行编辑距离排序。 double_metaphone() 不是任何 PHP 库的一部分(我上次检查过),所以我借用了很久以前在网上找到的 PHP 实现(网站不再在线)。我想我应该把它贴在某个地方。

编辑:该网站仍在 archive.org 中:
http://web.archive.org/web/20080728063208 /http://swoodbridge.com/DoubleMetaPhone/

或 Google 缓存:
http://webcache.googleusercontent .com/search?q=cache:Tr9taWl9hMIJ:swoodbridge.com/DoubleMetaPhone/+Stephen+Woodbridge+double_metaphon

这会导致许多其他有用的链接,其中包含 double_metaphone() 的源代码,包括 github 上的 Javascript 链接:< a href="http://github.com/maritz/js-double-metaphone" rel="nofollow noreferrer">http://github.com/maritz/js-double-metaphone

编辑:浏览我的旧代码,这里大致是我所做的步骤,伪代码以使其清晰:

1)为数据库中的每个单词预先计算一个 double_metaphone() ,即 $word='blahblah '; $soundslike=double_metaphone($word);

2) 在查找时,$word 根据数据库进行模糊搜索: $soundslike = double_metaphone($word)

4) SELECT * FROM table WHERE soundlike LIKE $soundlike (如果您将 levenstein 存储为过程,更好: SELECT * FROM table WHERE levenstein(soundlike,$soundlike) word< /code>,$word) ASC LIMIT ...等。

它对我来说效果很好,尽管我不能使用存储过程,因为我无法控制服务器并且它使用 MySQL 4.20 或其他版本。

I implemented such a functionality on one website. I use double_metaphone() + levenstein() in PHP. I precalculate a double_metaphone() for each entry in the dabatase, which I lookup using a SELECT of the first x chars of the 'metaphoned' searched term.

Then I sort the returned result according to their levenstein distance. double_metaphone() is not part of any PHP library (last time I checked), so I borrowed a PHP implementation I found somewhere a long while ago on the net (site no longer on line). I should post it somewhere I suppose.

EDIT: The website is still in archive.org:
http://web.archive.org/web/20080728063208/http://swoodbridge.com/DoubleMetaPhone/

or Google cache:
http://webcache.googleusercontent.com/search?q=cache:Tr9taWl9hMIJ:swoodbridge.com/DoubleMetaPhone/+Stephen+Woodbridge+double_metaphon

which leads to many other useful links with source code for double_metaphone(), including one in Javascript on github: http://github.com/maritz/js-double-metaphone

EDIT: Went through my old code, and here are roughly the steps of what I do, pseudo coded to keep it clear:

1) Precompute a double_metaphone() for every word in the database, i.e., $word='blahblah'; $soundslike=double_metaphone($word);

2) At lookup time, $word is fuzzy-searched against the database: $soundslike = double_metaphone($word)

4) SELECT * FROM table WHERE soundlike LIKE $soundlike (if you have levenstein stored as a procedure, much better: SELECT * FROM table WHERE levenstein(soundlike,$soundlike) < mythreshold ORDER BY levenstein(word,$word) ASC LIMIT ... etc.

It has worked well for me, although I can't use a stored procedure, since I have no control over the server and it's using MySQL 4.20 or something.

等待圉鍢 2024-09-18 06:48:54

我曾经问过类似的问题。 命名虚伪列表我从来没有抽出时间用它做任何事情,但问题在工作中再次出现所以我可能会在 .net 中编写并开源一个库来进行一些匹配。

更新:
我将其中提到的 perl 模块移植到 C# 并将其放在 github 上。 http://github.com/stimms/Nicknames

I asked a similar question once. Name Hypocorism List I never did get around to doing anything with it but the problem has come up again at work so I might write and open source a library in .net for doing some matching.

Update:
I ported the perl module I mentioned there to C# and put it up on github. http://github.com/stimms/Nicknames

绳情 2024-09-18 06:48:54

实现 Levenshtein 距离:

http://en.wikipedia.org/wiki/Levenshtein_distance

这可以可以编写为 SQL 函数并以多种不同的方式进行查询。

Implement the Levenshtein distance:

http://en.wikipedia.org/wiki/Levenshtein_distance

This can be written as a SQL Function and queried many different ways.

暗藏城府 2024-09-18 06:48:54

SSIS 有一些模糊逻辑任务,我们用它来事后查找重复项。

我认为,为了获得最佳结果,你需要让你的逻辑不仅仅考虑名称。如果他们输入地址、电子邮件或电话信息,也许您可​​以寻找具有相同姓氏且具有一个或多个其他匹配项的人,并询问其中一个是否可以。您还可以为各种名称制作一个昵称表并进行匹配。您不会获得所有这些,但您至少可以获得您所在国家/地区最常见的一些。

Well SSIS has some fuzzy logic tasks we use to find duplicates after the fact.

I think though you need to have your logic look at more than just the name for best results. If they are putting in address, email or phone information, perhaps you could look for people with the same last name with one or more of those other matches and ask if one of them will do. You could also make a table of nicknames for various names and match on that. You won't get all of them, but you could get some of the most common in your country at least.

美人骨 2024-09-18 06:48:54

您可以使用 SOUNDEX 获得发音相似的名称。但是,它不会与例如 William 和 Bill 匹配。

以 SQL 为例尝试一下。

SELECT SOUNDEX('John'), SOUNDEX('Jon')

You can use SOUNDEX to get similar sounding names. However, it won't match with William and Bill for example.

Try this in SQL as an example.

SELECT SOUNDEX('John'), SOUNDEX('Jon')
雪花飘飘的天空 2024-09-18 06:48:54

SQL Server 中有一些内置的 SOUNDS LIKE 功能,请参阅 SOUNDEX http://msdn.microsoft.com/en-us/library/aa259235%28SQL.80%29.aspx

至于完整/昵称搜索,据我所知,没有任何内置内容。昵称因地区而异,需要跟踪大量信息。可能有一个数据库将全名与昵称链接起来,您可以在自己的应用程序中利用它们。

There is some built-in SOUNDS LIKE functionality in SQL Server, see SOUNDEX http://msdn.microsoft.com/en-us/library/aa259235%28SQL.80%29.aspx

As for full / nickname searching there isn't anything built it that I am aware of. Nicknames vary by region and it's a lot of information to keep track of. There might be a database linking full names to nicknames that you could leverage in your own application.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文