数据清理:是否有我们可以使用的常见排列库?或者有更好的方法吗?
我们正在致力于清理和分析大量人工输入的客户数据。我们需要以编程方式确定两个地址(例如)是否相同,即使输入的数据略有不同。
现在,我们通过相当简单的字符串替换(例如,用 ave 替换 avenue)来运行每个地址,连接字段并比较结果。我们正在做一些与名字类似的事情。
至少,我们的搜索替换值列表似乎应该已经存在于某个地方。
或者也许您可以建议一种完全不同且更好的方法来检测匹配?
We are working on clean-up and analysis of a lot of human-entered customer data. We need to decide programmatically whether 2 addresses (for example) are the same, even though the data was entered with slight variations.
Right now we run each address through fairly simplistic string replacement (replacing avenue with ave, for example), concatenate the fields and compare the results. We are doing something similar with names.
At the very least, it seems like our list of search-replace values should already exist somewhere.
Or perhaps you can suggest a totally different and superior way to detect matches?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
对于地址,您应该通过谷歌的地图 API 运行它们并为每个地址获取地理编码。那么如果地理编码相同,则地点是相同的。我相信他们每天/ip 允许 10k 次点击免费。
你自己不可能想出更好的办法。
http://code.google.com/apis/maps/
For the addresses, you should run them through google's map api and get a geocode for each one. Then if the geocodes are the same, the place is the same. I believe they allow 10k hits/day/ip for free.
It's unlikely that you'd come up with anything better on your own.
http://code.google.com/apis/maps/
Soundex 及其变体可能是一个好的开始,维基百科页面建议的其他方法也是如此。
Soundex and its variants might be a good start as are other approaches suggested by that Wikipedia page.
本质上,您试图找出两个字符串的相似程度,并且有很多不同的方法来衡量它。骰子系数对于您正在做的事情来说可以相当有效,尽管它的操作成本有点高。
http://en.wikipedia.org/wiki/Dice_coefficient
如果您想要更全面的列表字符串相似度度量尝试在这里:
http://www.dcs.shef.ac.uk/~sam/ stringmetrics.html
Essentially you're trying to find how similar two strings are and there are a lot of different ways to measure it. Dice Coefficients could work fairly well for what you're doing, although it is a bit costly of an operation.
http://en.wikipedia.org/wiki/Dice_coefficient
If you want a more comprehensive list of string similarity measures try here:
http://www.dcs.shef.ac.uk/~sam/stringmetrics.html
在工作中,我帮助编写验证地址的软件(用于 SmartyStreet)。
地址验证是一项非常棘手的操作——事实上,美国邮政局已经指定了某些经过认证的公司可以提供这项服务。我不会建议(即使我处于你的立场)你自己尝试这样做。如前所述,Google 会进行一些地址解析,但仅近似地址。 Google 和雅虎以及类似服务将不会验证地址数据的准确性。
因此,您需要采用经过 CASS 认证的方法来解决此问题。我建议使用 LiveAddress API(用于入口点验证) 或认证清理(针对现有地址列表或数据库)。两者均经过 USPS 的 CASS 认证,可满足您的要求。
At work I help write software that verifies addresses (for SmartyStreets).
Address validation is a really tricky operation -- in fact the USPS has designated certain companies which are certified to provide this service. I would not recommend (even if I was in your shoes) that you attempt this on your own. As mentioned, Google does some address parsing, but only approximates the address. Google and Yahoo and similar services will not verify the accuracy of the address data.
So you'll need a CASS-Certified approach to this problem. I would suggest something like the LiveAddress API (for point-of-entry validation) or Certified Scrubbing (for existing lists or databases of addresses). Both are CASS-Certified by the USPS and will do what you require.