使用 T-SQL 进行模糊匹配
我有一个包含个人数据等的表人员。 有很多列,但这里感兴趣的是:addressindex
、lastname
和 firstname
,其中 addressindex
是唯一的地址一直延伸到公寓的门。 因此,如果我有“像下面这样”的两个人的姓氏
和一个名字
相同,那么他们很可能是重复的。
我需要一种方法来列出这些重复项。
tabledata:
personid 1
firstname "Carl"
lastname "Anderson"
addressindex 1
personid 2
firstname "Carl Peter"
lastname "Anderson"
addressindex 1
如果我要在所有列上完全匹配,我知道该怎么做,但我需要模糊匹配来完成这个技巧(从上面的示例),结果如下:
Row personid addressindex lastname firstname
1 2 1 Anderson Carl Peter
2 1 1 Anderson Carl
.....
有关如何以良好的方式解决此问题的任何提示?
I have a table Persons with personaldata and so on. There are lots of columns but the once of interest here are: addressindex
, lastname
and firstname
where addressindex
is a unique address drilled down to the door of the apartment.
So if I have 'like below' two persons with the lastname
and one the firstnames
are the same they are most likely duplicates.
I need a way to list these duplicates.
tabledata:
personid 1
firstname "Carl"
lastname "Anderson"
addressindex 1
personid 2
firstname "Carl Peter"
lastname "Anderson"
addressindex 1
I know how do this if I were to match exactly on all columns but I need fuzzy match to do the trick with (from the above example) a result like:
Row personid addressindex lastname firstname
1 2 1 Anderson Carl Peter
2 1 1 Anderson Carl
.....
Any hints on how to solve this in a good way?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(10)
自主数据服务第一个版本以来,您可以访问比 SOUNDEX 实现的更高级的模糊逻辑算法。 因此,只要您安装了 MDS,您就可以在 mdq 架构(MDS 数据库)中找到名为 Similarity() 的函数。
有关其工作原理的更多信息: http://blog.hoegaerden.be/2011/02/05/finding-similar-strings-with-fuzzy-logic-functions-built-into-mds/
Since the first release of Master Data Services, you've got access to more advanced fuzzy logic algorithms than what SOUNDEX implements. So provided that you've got MDS installed, you'll be able to find a function called Similarity() in the mdq schema (MDS database).
More info on how it works: http://blog.hoegaerden.be/2011/02/05/finding-similar-strings-with-fuzzy-logic-functions-built-into-mds/
您可以使用 SQL Server 中的 SOUNDEX 和相关 DIFFERENCE 函数来查找相似的名称。 MSDN 上的参考位于此处。
You can use the SOUNDEX and related DIFFERENCE function in SQL Server to find similar names. The reference on MSDN is here.
关于去重复的事情,你的字符串分割和匹配是很好的第一次切割。 如果有关于数据的已知项目可以用来减少工作量和/或产生更好的结果,那么利用它们总是好的。 请记住,对于重复数据删除来说,通常不可能完全消除手动工作,尽管您可以通过自动捕获尽可能多的数据然后生成“不确定性案例”的报告来使手动工作变得更加容易。
关于名称匹配:SOUNDEX 对于匹配质量来说很糟糕,对于您尝试做的工作类型尤其糟糕,因为它会匹配距离目标太远的东西。 最好使用双变音位结果和编辑距离的组合来执行名称匹配。 通过适当的偏置,这种方法效果非常好,并且可以在对已知信息进行清理后用于第二次传递。
您可能还需要考虑使用 SSIS 包并研究模糊查找和分组转换 (http://msdn.microsoft.com/en-us/library/ms345128(SQL.90).aspx)。
也可以使用 SQL 全文搜索 (http://msdn.microsoft.com/en-us/library/cc879300.aspx),但可能不适合您的特定问题域。
Regarding de-duping things your string split and match is great first cut. If there are known items about the data that can be leveraged to reduce workload and/or produce better results, it is always good to take advantage of them. Bear in mind that often for de-duping it is impossible to entirely eliminate manual work, although you can make that much easier by catching as much as you can automatically and then generating reports of your "uncertainty cases."
Regarding name matching: SOUNDEX is horrible for quality of matching and especially bad for the type of work you are trying to do as it will match things that are too far from the target. It's better to use a combination of double metaphone results and the Levenshtein distance to perform name matching. With appropriate biasing this works really well and could probably be used for a second pass after doing a cleanup on your knowns.
You may also want to consider using an SSIS package and looking into Fuzzy Lookup and Grouping transformations (http://msdn.microsoft.com/en-us/library/ms345128(SQL.90).aspx).
Using SQL Full-Text Search (http://msdn.microsoft.com/en-us/library/cc879300.aspx) is a possibility as well, but is likely not appropriate to your specific problem domain.
这样做
,然后创建一个触发器
现在我能做的是我可以创建一个看起来像这样的过程,
这将返回与为特定 personid 提供的名称几乎匹配的所有名称
do it this way
and later on create a trigger
now what i can do is i can create a procedure which looks something like this
this will return you all the names that are nearly in match with the names provided by for a particular personid
我个人使用 Jaro-Winkler 算法的 CLR 实现,该算法似乎工作得很好- 它在处理长度超过 15 个字符的字符串时有点困难,并且不喜欢匹配电子邮件地址,但其他方面都很好 - 可以找到完整的实施指南 这里
这是 t-sql 中的示例。
与 soundex 相比,这种方法的优点在于,它可以更多轻松处理拼写错误,因为如果拼写错误产生不同的声音,那么 SOUNDEX 将无法正确匹配它。
例如,如果您输入“ytpe”而不是“type”,那么 SOUNDEX 将不会为您提供正确的匹配项。 SOUNDEX('ytpe') 和 SOUNDEX('type') 返回的代码如下:
ytpe Y310
type T100
如果我使用 Jaro-Winkler StringDistance('ytpe','type') 这是我得到的 0.91(6) ,这意味着这是一个很好的匹配。 请记住 1 – 是完全匹配,因此 0.91 非常接近它。
如果您出于某种原因无法使用 CLR 函数,也许您可以尝试通过 SSIS 包运行数据(使用模糊转换查找) - 详细 此处
I personally use a CLR implementation of the Jaro-Winkler algorithm which seems to work pretty well - it struggles a bit with strings longer than about 15 characters and doesn't like matching email addresses but otherwise is quite good - full implementation guide can be found here
Here is a sample of what that looks like in t-sql
An advantage of this approach over soundex is that this can more easily handle typos, because if the typo produces different sound then SOUNDEX will not match it correctly.
For example, if you typed “ytpe” instead of “type” then SOUNDEX will not give you a correct match. The codes returned for SOUNDEX(‘ytpe’) and SOUNDEX(‘type’) are the following:
ytpe Y310
type T100
If I use Jaro-Winkler StringDistance(‘ytpe’,’type’) here is what I get 0.91(6), which means it is a good match. Remember 1 – is exact match, so 0.91 is very close to it.
If you are unable to use CLR functions for whatever reasons, maybe you could try running the data through an SSIS package (using the fuzzy transformation lookup) - detailed here
代码的第 2 部分 对于 Redfilter 答案中的链接:
参考文献:
https://github.com/mb16/geocoderNet/blob/ master/build/sql/doubleMetaphone.sql
Part 2 of the code For the link in Redfilter Answer:
References:
https://github.com/mb16/geocoderNet/blob/master/build/sql/doubleMetaphone.sql
分两部分粘贴RedFilter代码,以避免链接失效
参考文献:
https://github.com/mb16/geocoderNet/blob/ master/build/sql/doubleMetaphone.sql
第 1 部分:
Pasting RedFilter code in two parts ,so as to avoid link rot
References:
https://github.com/mb16/geocoderNet/blob/master/build/sql/doubleMetaphone.sql
Part1:
我会使用 SQL Server 全文索引,它允许您进行搜索并返回不仅包含该单词而且可能存在拼写错误的内容。
I would use SQL Server Full Text Indexing, which will allow you to do searches and return things that not only contain the word but also may have a misspelling.
除了这里的其他好信息之外,您可能还需要考虑使用 Double Metaphone 语音算法通常被认为比 SOUNDEX 更好。
Tim Pfeiffer 在他的文章 Double Metaphone 听起来很棒转换 C++ 中详细介绍了 SQL 中的实现Double Metaphone 算法到 T-SQL(最初位于 SQL Mag & 然后在 SQL Server Pro 的存档)。
这将有助于匹配具有轻微拼写错误的姓名,例如 Carl 与 Karl。
更新:实际的可下载代码似乎已经消失,但是 这是在 github 存储库上找到的实现,它似乎克隆了原始代码
In addition to the other good info here, you might want to consider using the Double Metaphone phonetic algorithm which is generally considered to be better than SOUNDEX.
Tim Pfeiffer details an implementation in SQL in his article Double Metaphone Sounds Great Convert the C++ Double Metaphone algorithm to T-SQL (originally in SQL Mag & then in SQL Server Pro).
That will assist in matching names with slight misspellings, e.g., Carl vs. Karl.
Update: The actual downloadable code seems to be gone, but here's an implementation found on a github repo that appears to have cloned the original code
我发现 SQL Server 提供的用于模糊匹配的东西非常笨重。 我使用 Levenshtein 距离算法和一些加权来使用自己的 CLR 函数,运气非常好。 然后,我使用该算法创建了一个名为 GetSimilarityScore 的 UDF,它接受两个字符串并返回 0.0 到 1.0 之间的分数。 匹配越接近 1.0 越好。 然后,以>=0.8左右的阈值进行查询以获得最有可能的匹配。 像这样的事情:
只是不要用非常大的桌子来做。 这是一个缓慢的过程。
这是 CLR UDF:
I've found that the stuff SQL Server gives you to do fuzzy matching is pretty clunky. I've had really good luck with my own CLR functions using the Levenshtein distance algorithm and some weighting. Using that algorithm, I've then made a UDF called GetSimilarityScore that takes two strings and returns a score between 0.0 and 1.0. The closer to 1.0 the match is, the better. Then, query with a threshold of >=0.8 or so to get the most likely matches. Something like this:
Just don't do it with really large tables. It's a slow process.
Here's the CLR UDFs: