我想从数据库中过滤掉重复的客户名称。一位客户可能有多个同名但拼写差异不大的系统条目。这是一个示例:名为 Brook 的客户可能有 3 个系统条目
具有以下变体:
- Brook Berta
- Bruck Berta
- Biruk Berta
假设我们将此名称放入一个数据库列中。
我想知道识别 100,000 条记录中此类重复的不同机制。我们可以使用 C# 中的正则表达式来迭代所有记录或其他一些模式匹配技术,或者我们可以将这些记录导出到最适合此类查询的记录(具有正则表达式功能的 SQL)。
这就是我认为的解决方案
- 编写 C# 代码来迭代每个记录
- 仅按顺序获取辅音字母(在上面的情况下:BrKBrt)
- 考虑到从其他记录中搜索相同的辅音模式
类似发音的字母,如 (C,K) (C,S), (F, PH)
所以请转发任何想法。
I want to filter out duplicate customer names from a database. A single customer may have more than one entry to the system with the same name but with little difference in spelling. So here is an example: A customer named Brook may have three entries to the system
with this variations:
- Brook Berta
- Bruck Berta
- Biruk Berta
Let's assume we are putting this name in one database column.
I would like to know the different mechanisms to identify such duplications form say a 100,000 records. We may use regular expressions in C# to iterate through all records or some other pattern matching technique or we may export these records to what ever best fits for such queries (SQL with Regular Expression capabilities)).
This is what I thought as a solution
- Write a C# code to iterate through each record
- Get only the Consonant letters in order (in the above case: BrKBrt)
- Search for the same Consonant pattern from the other records considering
similar sounding letters like (C,K) (C,S), (F, PH)
So please forward any ideas.
发布评论
评论(8)
Double Metaphone 算法于 2000 年发布,是 Soundex 算法的一个新的改进版本,于 1918 年获得专利。
本文提供了多种语言的 Double Metaphone 实现的链接。
The Double Metaphone algorithm, published in 2000, is a new and improved version of the Soundex algorithm that was patented in 1918.
The article has links to Double Metaphone implementations in many languages.
看看 Soundex
Transact-SQL 中有一个 Soundex 函数(请参阅 http://msdn.microsoft.com/en-us/library/ms187384.aspx):
为每个示例值返回相同的值 B620
Have a look at Soundex
There is a Soundex function in Transact-SQL (see http://msdn.microsoft.com/en-us/library/ms187384.aspx):
returns the same value B620 for each of the example values
用于查找字符串相似性的明显的、已建立的(且有据可查的)算法是:
The obvious, established (and well documented) algorithms for finding string similarity are:
我会考虑写一些诸如“著名的”Python 拼写检查器之类的东西。
http://norvig.com/spell- Correct.html
这将需要一个词并找到基于缺失字母、添加字母、交换字母等的所有可能的替代方案。
I would consider writing something such as the "famous" python spell checker.
http://norvig.com/spell-correct.html
This will take a word and find all possible alternatives based on missing letters, adding letters, swapping letters, etc.
您可能想在 Google 上搜索
语音相似度算法
,您会发现大量相关信息。包括 Codeproject 上的这篇关于在 C# 中实现解决方案的文章。You might want to google for
phonetic similarity algorithm
and you'll find plenty of information about this. Including this article on Codeproject about implementing a solution in C#.看看 soundex。它是大多数语言中相当标准的库,可以满足您的要求,即通过算法识别语音相似性。
http://en.wikipedia.org/wiki/Soundex
Look into soundex. It's a pretty standard library in most languages that does what you require, i.e. algorithmically identify phonetic similarity.
http://en.wikipedia.org/wiki/Soundex
有一个非常好的 R(只需在 Google 中搜索“R”)包用于 Record Linkage。标准示例完全针对您的问题: R RecordLinkage
C -Soundex 等的代码直接取自 PostgreSQL!
There is a very nice R (just search for "R" in Google) package for Record Linkage. The standard examples target exactly your problem: R RecordLinkage
The C-Code for Soundex etc. is taken directly from PostgreSQL!
对于此解决方案,我会推荐 Soundex 和 Lev 距离的派生算法。恕我直言,莱文斯坦距离更适合拼写检查解决方案。
I would recommend Soundex and derived algorithms over Lev distance for this solution. Levenstein distance more appropriate for spell checking solutions imho.