我在 SQL Server 2008 中有 Persons
表。
我的目标是找到地址几乎相似的人。
地址由 state
、town
、street
、house
、apartment
列描述、邮政编码
和电话
。
由于某些州(非美国)的一些具体差异和人为因素(地址错误等),地址不会以相同的模式填写。
地址中最常见的错误
- 区分大小写
- 有人写“apt.”,另一个写“apartment”或“ap.”。 (尽管地址不是用英语书写的)
- 空格、点、逗号
- 书写街道名称的差异,例如“Dr.”琼斯街”或“琼斯博士街”或“D.乔恩. st.”或“Dr Jones st”等。
主要问题是数据的模式不同,因此很难找到相似的地址。
有没有针对此类问题的算法?
先谢谢了。
更新
- 正如我提到的,地址被分成不同的列,我应该生成一个字符串连接列还是为每列执行您的步骤?
我认为我不应该连接列,但是如果我要单独比较列,我应该如何组织它?我是否应该找到每一列的相似之处,将它们联合或相交或其他什么?
- 我应该收集一些统计数据或某种教育算法吗?
I have Persons
table in SQL Server 2008.
My goal is to find Persons who have almost similar addresses.
The address is described with columns state
, town
, street
, house
, apartment
, postcode
and phone
.
Due to some specific differences in some states (not US) and human factor (mistakes in addresses etc.), address is not filled in the same pattern.
Most common mistakes in addresses
- Case sensitivity
- Someone wrote "apt.", another one "apartment" or "ap." (although addresses aren't written in English)
- Spaces, dots, commas
- Differences in writing street names, like 'Dr. Jones str." or "Doctor Jones street" or "D. Jon. st." or "Dr Jones st" etc.
The main problem is that data isn't in the same pattern, so it's really difficult to find similar addresses.
Is there any algorithm for this kind of issue?
Thanks in advance.
UPDATE
- As I mentioned address is separated into different columns. Should I generate a string concatenating columns or do your steps for each column?
I assume I shouldn't concatenate columns, but if I'll compare columns separately how should I organize it? Should I find similarities for each column an union them or intersect or anything else?
- Should I have some statistics collecting or some kind of educating algorithm?
发布评论
评论(15)
建议这样处理:
从各个条目创建单词级 n-gram(trigram/4-gram 可以做到)
对字符串比较进行多 x 多比较,并按字符串距离对它们进行聚类。有人提出了 Levenshtein;对于此类任务,有更好的方法,Jaro-Winkler Distance 和 Smith-Waterman 效果更好。像 SimMetrics 这样的库会让生活变得更轻松
一旦有了 n 元语法簇,您就可以使用组成子元语法解析整个字符串,即 D.Jones St =>戴维·琼斯街=> DJones St.
不应该太难,这是一个太常见的问题。
更新:根据上面的更新,以下是建议的步骤
将列连接到单个字符串中,也许创建一个数据库“视图”。例如,
创建视图vwAddress
作为
选择前 10000 个
州镇、街道、房屋、公寓、邮政编码、
州+城镇+街道+房屋+公寓+邮政编码作为地址
from ...
编写一个单独的应用程序(例如使用 Java 或 C#/VB.NET)并使用 JaroWinkler 等算法来估计组合地址的字符串距离,以创建多 x 多比较。并写入一个单独的表中
地址1 |地址 n |相似性
您可以使用 Simmetrics 来获得相似度,如下所示:
您还可以对它进行三元组,这样“1 Jones Street, Sometown, SomeCountry”这样的地址就变成“1 Jones Street”、“Jones Street Sometown”等....
并比较卦象。 (甚至 4 克)以获得更高的准确度。
最后,您可以按相似度排序以获得最相似地址的集群并确定适当的阈值。不知道为什么你被卡住了
Suggest approaching it thus:
Create word-level n-grams (a trigram/4-gram might do it) from the various entries
Do a many x many comparison for string comparison and cluster them by string distance. Someone suggested Levenshtein; there are better ones for this kind of task, Jaro-Winkler Distance and Smith-Waterman work better. A libraryt such as SimMetrics would make life a lot easier
Once you have clusters of n-grams, you can resolve the whole string using the constituent subgrams i.e. D.Jones St => Davy Jones St. => DJones St.
Should not be too hard, this is an all-too-common problem.
Update: Based on your update above, here are the suggested steps
Catenate your columns into a single string, perhaps create a db "view" . For example,
create view vwAddress
as
select top 10000
state town, street, house, apartment, postcode,
state+ town+ street+ house+ apartment+ postcode as Address
from ...
Write a separate application (say in Java or C#/VB.NET) and Use an algorithm like JaroWinkler to estimate the string distance for the combined address, to create a many x many comparison. and write into a separate table
address1 | address n | similarity
You can use Simmetrics to get the similarity thus:
You could also trigram it so that an address such as "1 Jones Street, Sometown, SomeCountry" becomes "1 Jones Street", "Jones Street Sometown", and so on....
and compare the trigrams. (or even 4-grams) for higher accuracy.
Finally you can order by similarity to get a cluster of most similar addresses and decide an approprite threshold. Not sure why you are stuck
我会尝试执行以下操作:
,保持数据“正常”的解决方案是为数据库中的每个特征设置显式字段 否则,您最终将每隔几个月进行一次此练习。
I would try to do the following:
Of course, the solution to keep your data 'in shape' is to have explicit fields for each of your characteristics in your database. Otherwise, you will end up doing this exercise every few months.
我在这里看到的主要问题是准确定义平等。
即使有人写乔恩。还有另一个琼斯。 - 你永远无法说它们是否相同。 (Jon-Jonethan、Joneson、Jonedoe 等等;)
我在一家公司工作,我们必须准确处理这个问题 - 恐怕我必须告诉你,这种检查导航系统地址列表的操作是“手动”完成的大多数时候。缩写有时取决于上下文,并且还有其他原因使这变得困难。 Ofc 替换字符串等是用 python 完成的 - 但告诉你这样一个缩写的含义。在少数情况下只能通过脚本完成。 (“St.”->可以是“Saint”和“Street”。如何决定?不可能……这是人类的工作。)。
另一个大问题是正如你所说的“街上有一个‘DJones’还是一个人?或者两者都有?这里是哪一个?这个DJones是和Dr Jones一样还是和Don Jones一样?无法决定!
你可以对此处另一个答案提出的列表进行一些工作 - 但它会给你足够的“误报”左右。
The main problem I see here is to exactly define equality.
Even if someone writes Jon. and another Jone. - you will never be able to say if they are the same. (Jon-Jonethan,Joneson,Jonedoe whatever ;)
I work in a firm where we have to handle exact this problem - I'm afraid I have to tell you this kind of checking the adress lists for navigation systems is done "by hand" most of the time. Abbrevations are sometimes context dependend, and there are other things that make this difficult. Ofc replacing string etc is done with python - but telling you the MEANING of such an abbr. can only done by script in a few cases. ("St." -> Can be "Saint" and "Street". How to decide? impossible...this is human work.).
Another big problem is as you said "Is there a street "DJones" or a person? Or both? Which one is ment here? Is this DJones the same as Dr Jones or the same as Don Jones? Its impossible to decide!
You can do some work with lists as presented by another answer here - but it will give you enough "false positives" or so.
您有一个邮政编码字段!!!
那么,为什么不购买您所在国家/地区的邮政编码表
并用它来清理您的街道/城镇/地区/省信息?
You have a postcode field!!!
So, why don't you just buy a postcode table for your country
and use that to clean up your street/town/region/province information?
我在上个世纪做了一个这样的项目。基本上,这是合并后两个客户文件的合并,并且涉及来自三个不同来源的姓名和地址。
首先,正如许多发帖者所建议的那样,将所有常见单词、缩写和拼写错误转换为常见形式“Apt”。 “公寓”等改为“Apt”。
然后查看名字并识别名字的第一个字母,加上第一个姓氏。 (考虑“医学博士。亨利·德·巴斯克维尔·斯迈斯爵士”并不那么容易)但不要担心有歧义的地方,只要两者都拿走即可!因此,如果您幸运的话,您将获得 HBASKERVILLE 和 HSMYTHE。现在去掉所有元音,因为这是大多数拼写变化发生的地方,所以现在你有了 HBSKRVLL HSMTH。
您还可以从“H. Baskerville”、“Sir Henry Baskerville Smith”以及不幸的是“Harold Smith”获得这些字符串,但我们在这里讨论的是模糊匹配!
在街道、公寓和邮政编码字段上执行类似的练习。但不要扔掉原始数据!
现在,有趣的是,首先比较每个原始字符串,并为每个完全匹配的字符串给出 50 分。然后检查“规范化”的字符串,并为每个完全匹配的字符串打 20 分。然后遍历所有字符串,并为每个四个或更多字符的共同子字符串给出 5 分。对于每一对比较,您最终都会得到一些分数 > 的结果。 150,您可以认为是某种匹配,有些分数低于 50,您可以认为不匹配,有些介于两者之间,有一定的匹配概率。
您需要进行更多调整来改进这一点,添加各种规则,例如“姓氏‘史密斯’减去 20 分”。您确实必须不断运行和调整,直到您对结果匹配感到满意为止,但是,一旦您查看结果,您就会很好地感觉到哪个分数可以考虑“匹配”,以及哪些是您需要消除的误报的。
I did a project like this in the last centuary. Basicly it was a consolidation of two customer files after a merger, and, involved names and addresses from three different sources.
Firstly as many posters have suggested, convert all the common words and abbreveations and spelling mistakes to a common form "Apt." "Apatment" etc. to "Apt".
Then look through the name and identifiy the first letter of the first name, plus the first surname. (Not that easy consider "Dr. Med. Sir Henry de Baskerville Smythe") but dont worry where there are amiguities just take both! So if you lucky you get HBASKERVILLE and HSMYTHE. Now get rid of all the vowels as thats where most spelling variations occur so now you have HBSKRVLL HSMTH.
You would also get these strings from "H. Baskerville","Sir Henry Baskerville Smith" and unfortunately "Harold Smith" but we are talking fuzzy matching here!
Perform a similar exercise on the street, and apartment and postcode fields. But do not throw away the original data!
You now come to the interesting bit first you compare each of the original strings and give say 50 points for each string that matches exactly. Then go through you "normalised" strings and give say 20 points for each one that matches exactly. Then go through all the strings and give say 5 points for each four character or more substring they have in common. For each pair compared you will end up with some with scores > 150 which you can consider as a certain match, some with scores less than 50 which you can consider not matched and some inbetween which have some probability of matching.
You need some more tweaking to improve this by adding various rules like "subtract 20 points for a surname of 'smith'". You really have to keep running and tweaking until you get happy with the resulting matches, but, once you look at the results you get a pretty good feel which score to consider a "match" and which are the false positives you need to get rid of.
我认为数据量可能会影响哪种方法最适合您。
当我从不同艺术家的合辑中索引音乐时,我遇到了类似的问题。有时艺术家排在第一位,有时歌曲名称排在第一位,并且具有各种分隔符样式。
我所做的是计算具有相同值的其他条目出现的次数,以做出有根据的猜测,无论是歌曲名称还是艺术家。
也许您可以使用 soundex 或类似的算法来查找相似的内容。
编辑:(也许我应该澄清一下,我认为艺术家名字比歌曲名称更有可能更频繁地出现。)
I think the amount of data could affect what approach works best for you.
I had a similar problem when indexing music from compilation albums with various artists. Sometimes the artist came first, sometimes the song name, with various separator styles.
What I did was to count the number of occurrences on other entries with the same value to make an educated guess wether it was the song name or an artist.
Perhaps you can use soundex or similar algorithm to find stuff that are similar.
EDIT: (maybe I should clarify that I assumed that artist names were more likely to be more frequently reoccurring than song names.)
您在评论中提到的一件重要的事情是您将以交互方式执行此操作。
这允许解析用户输入,同时验证对任何缩写的猜测并纠正许多错误(例如电话号码输入在某些联系人管理系统中的工作方式 - 系统尽最大努力解析和纠正国家/地区代码、区号和号码,但最终用户会得到猜测并有机会纠正输入)
如果你想做得很好,那么保留邮政编码、城镇、街道、缩写及其变体的数据库/字典可以改进数据验证和预处理。
所以,至少你会有完全合格的地址。如果您可以对所有输入执行此操作,您将对所有数据进行分类,然后可以在某些字段上严格匹配,而在其他字段上不那么严格,并根据您分配的权重计算匹配分数。
在对输入进行一致预处理后,n 元模型应该能够找到相似的地址。
One important thing that you mention in the comments is that you are going to do this interactively.
This allows to parse user input and also at the same time validate guesses on any abbreviations and to correct a lot of mistakes (the way for example phone number entry works some contact management systems - the system does the best effort to parse and correct the country code, area code and the number, but ultimately the user is presented with the guess and has the chance to correct the input)
If you want to do it really good then keeping database/dictionaries of postcodes, towns, streets, abbreviations and their variations can improve data validation and pre-processing.
So, at least you would have fully qualified address. If you can do this for all the input you will have all the data categorized and matches can then be strict on certain field and less strict on others, with matching score calculated according weights you assign.
After you have consistently pre-processed the input then n-grams should be able to find similar addresses.
您是否为此查看过 SQL Server Integration Services?模糊查找组件允许您查找“近似匹配”:http://msdn .microsoft.com/en-us/library/ms137786.aspx
对于新输入,您可以从 .Net 代码调用该包,将要检查的值行作为一组参数传递,您可能需要保留令牌索引,使其足够快以进行用户交互。
这里有一个地址匹配的示例: http://msdn.microsoft.com/ en-us/magazine/cc163731.aspx
Have you looked at SQL Server Integration Services for this? The Fuzzy Lookup component allows you to find 'Near matches': http://msdn.microsoft.com/en-us/library/ms137786.aspx
For new input, you could call the package from .Net code, passing the value row to be checked as a set of parameters, you'd probably need to persist the token index for this to be fast enough for user interaction though.
There's an example of address matching here: http://msdn.microsoft.com/en-us/magazine/cc163731.aspx
我假设响应时间并不重要,问题是在数据库中查找现有地址,而不是合并重复项。我还假设数据库包含大量地址(例如 300 万个),而不是可以通过手动或通过 亚马逊的 Mechanical Turk。
预计算 - 识别信息含量高的地址片段。
当提供输入地址时,
这个想法是在地址中找到足够的具有高信息内容的片段,可以搜索这些片段以给出合理数量的替代方案,而不是找到最佳匹配。为了更好地容忍拼写错误,可以使用 trigrams、tetra-grams 或 soundex 代码来代替单词。
显然,如果您有实际州/城镇/街道的列表,那么数据库和搜索地址中都可能会进行一些数据清理。 (我很惊讶亚美尼亚邮政服务没有提供这样的列表,但我知道有些邮政服务对这些信息收取过高的费用。)
实际上,我所看到的大多数正在使用的系统都会在可能的情况下尝试通过电话号码查找人们的帐户:显然,这是否是一个实用的解决方案取决于数据及其性质。准确性。
(还要考虑横向思维方法:您能否找到一家可以为您清理数据库的邮购邮件列表经纪公司?他们甚至可能愿意向您支付使用地址的费用。)
I'm assuming that response time is not critical and that the problem is finding an existing address in a database, not merging duplicates. I'm also assuming the database contains a large number of addresses (say 3 million), rather than a number that could be cleaned up economically by hand or by Amazon's Mechanical Turk.
Pre-computation - Identify address fragments with high information content.
When presented with an input address,
The idea is to find enough fragments with high information content in the address which can be searched for to give a reasonable number of alternatives, rather than to find the most optimal match. For more tolerance to misspelling, trigrams, tetra-grams or soundex codes could be used instead of words.
Obviously if you have lists of actual states / towns / streets then some data clean-up could take place both in the database and in the search address. (I'm very surprised the Armenian postal service does not make such a list available, but I know that some postal services charge excessive amounts for this information. )
As a practical matter, most systems I see in use try to look up people's accounts by their phone number if possible: obviously whether that is a practical solution depends upon the nature of the data and its accuracy.
(Also consider the lateral-thinking approach: could you find a mail-order mail-list broker company which will clean up your database for you? They might even be willing to pay you for use of the addresses.)
我发现了一篇很棒的文章。
添加一些dll作为sql用户定义函数,我们可以使用SimMetrics库来使用字符串比较算法。
检查一下
http://anastasiosyal.com/archive/2009/01/11/ 18.aspx
I've found a great article.
Adding some dlls as sql user-defined functions we can use string comparison algorithms using SimMetrics library.
Check it
http://anastasiosyal.com/archive/2009/01/11/18.aspx
这种变化的可能性是无数的,即使存在这样的算法,它也永远不可能是万无一失的。毕竟你没有名词拼写检查器。
您可以做的是提供先前输入的字段值的下拉列表,以便他们可以选择一个(如果特定名称已存在)。
最好为每个值(如公寓等)设置单独的字段。
the possibilities of such variations are countless and even if such an algorithm exists, it can never be fool-proof. u can't have a spell checker for nouns after all.
what you can do is provide a drop-down list of previously entered field values, so that they can select one, if a particular name already exists.
its better to have separate fields for each value like apartments and so on.
您可以将所有地址扔到 Google 地图等网络服务中(不过我不知道这个是否合适),看看它们是否提供相同的 GPS 坐标。
You could throw all addresses at a web service like Google Maps (I don't know whether this one is suitable, though) and see whether they come up with identical GPS coordinates.
一种方法是将 Levenshtein 距离 算法应用于地址字段。这将允许您比较字符串的相似性。
编辑
在查看了您正在处理的地址差异类型之后,这可能根本没有帮助。
One method could be to apply the Levenshtein distance algorithm to the address fields. This will allow you to compare the strings for similarity.
Edit
After looking at the kinds of address differences you are dealing with, this may not be helpful after all.
另一个想法是利用学习。例如,您可以了解每个缩写词及其在句子中的位置,该缩写词的含义。
例如,您可以使用决策树并让用户训练系统。也许每种用途的几个例子就足够了。您不会将像 D.Jones 这样的单字母缩写归类为 David Jones 或 Dr. Jones。但经过第一级翻译后,您可以查找城镇的街道索引,看看是否可以将 D. 扩展为街道名称。
同样,在存储每个地址之前,您将通过决策树运行每个地址。
感觉应该有一些商业产品可以做到这一点。
Another idea is to use learning. For example you could learn, for each abbreviation and its place in the sentence, what the abbreviation means.
You could, for example, use decision trees and have a user train the system. Probably few examples of each use would be enough. You wouldn't classify single-letter abbreviations like D.Jones that could be David Jones, or Dr. Jones as likely. But after a first level of translation you could look up a street index of the town and see if you can expand the D. into a street name.
Again, you would run each address through the decision tree before storing it.
It feels like there should be some commercial products doing this out there.
一种可能性是在数据库中有一个字典表,将所有变体映射到单词的“正确”版本:
然后在比较之前通过字典运行每个单词。
编辑:这个单独太天真而不实用(请参阅评论)。
A possibility is to have a dictionary table in the database that maps all the variants to the 'proper' version of the word:
Then you run each word through the dictionary before you compare.
Edit: this alone is too naive to be practical (see comment).