SQL地址数据很乱,如何在查询中清理它?
我的地址数据存储在 SQL Server 2000 数据库中,我需要提取给定客户代码的所有地址。问题是,有很多拼写错误的地址,有些缺少部分等等。所以我需要以某种方式清理它。我需要剔除错误的拼写、缺失的部分等,并得出“平均”记录。例如,如果 New York 在 5 条记录中有 4 条拼写正确,则该值应该是返回的值。
我无法修改数据、验证输入数据或类似的事情。我只能修改数据的副本,或通过查询对其进行操作。
我在这里得到了部分答案存储在 SQL Server 中的地址有许多小的变化(错误),但我需要允许多个有效地址每个代码。
样本数据
Code Name Address1 Address2 City State Zip TimesUsed 10003 AMERICAN NUTRITON INC 2183 BALL STREET OLDEN Utah 87401 177 10003 AMEICAN NUTRITION INC 2183 BALL STREET PO BOX 1504 OLDEN Utah 87402 76 10003 AMERICAN NUTRITION INC 2183 BALL STREET OLDEN Utah 87402 24 10003 AMERICAN NUTRITION INC 2183 BALL STREET PO BOX 1504 OLDEN Utah 87402 17 10003 Samantha Brooks 506 S. Main Street Ellensburg Washington 98296 1 10003 BEMIS COMPANY 1401 W. FOURTH PLAIN BLVD. VANCOUVER Washington 98660 1 10003 CEI 597 VANDYRE BOULEVARD WRIGHTSTOWN Wisconsin 54180 1 10003 Pacific Pet 28th Avenue OLDEN Utah 84401 1 10003 PETSMART, INC. 16091 NORTH 25TH STREET PHOENA Arizona 85027 1 10003 THE PET FIRM 16418 NORTH 37TH STREET PHOENA Arizona 85503 1
所需输出
Code Name Address1 Address2 City State Zip 10003 AMERICAN NUTRITION INC 2183 BALL AVENUE Olden Utah 84401 10003 Samantha Brooks 506 S. Main Street Ellensburg Washington 98296 10003 BEMIS COMPANY 1401 W. FOURTH PLAIN BLVD. VANCOUVER Washington 98660 10003 CEI 975 VANDYKE ROAD WRIGHTSTOWN Wisconsin 54180 10003 Pacific Pet 29th Street OGDEN Utah 84401 10003 PETSMART, INC. 16091 NORTH 25TH AVENUE PHOENA Arizona 85027 10003 THE PET FIRM 16418 NORTH 37TH STREET PHOENA Arizona 85503
I have address data stored in an sql server 2000 database, and I need to pull out all the addresses for a given customer code. The problem is, there are a lot of misspelled addresses, some with missing parts, etc. So I need to clean this up somehow. I need to weed oout the bad spellings, missing parts, etc and come up with the "average" record. For example, if New York is spelled properly in 4 out of 5 records, that should be the value returned.
I can't modify the data, validate it on input, or anything like that. I can only modify a copy of the data, or manipulate it through a query.
I got a partial answer here Addresses stored in SQL server have many small variations(errors), but I need to allow for multiple valid addresses per code.
Sample Data
Code Name Address1 Address2 City State Zip TimesUsed 10003 AMERICAN NUTRITON INC 2183 BALL STREET OLDEN Utah 87401 177 10003 AMEICAN NUTRITION INC 2183 BALL STREET PO BOX 1504 OLDEN Utah 87402 76 10003 AMERICAN NUTRITION INC 2183 BALL STREET OLDEN Utah 87402 24 10003 AMERICAN NUTRITION INC 2183 BALL STREET PO BOX 1504 OLDEN Utah 87402 17 10003 Samantha Brooks 506 S. Main Street Ellensburg Washington 98296 1 10003 BEMIS COMPANY 1401 W. FOURTH PLAIN BLVD. VANCOUVER Washington 98660 1 10003 CEI 597 VANDYRE BOULEVARD WRIGHTSTOWN Wisconsin 54180 1 10003 Pacific Pet 28th Avenue OLDEN Utah 84401 1 10003 PETSMART, INC. 16091 NORTH 25TH STREET PHOENA Arizona 85027 1 10003 THE PET FIRM 16418 NORTH 37TH STREET PHOENA Arizona 85503 1
Desired Output
Code Name Address1 Address2 City State Zip 10003 AMERICAN NUTRITION INC 2183 BALL AVENUE Olden Utah 84401 10003 Samantha Brooks 506 S. Main Street Ellensburg Washington 98296 10003 BEMIS COMPANY 1401 W. FOURTH PLAIN BLVD. VANCOUVER Washington 98660 10003 CEI 975 VANDYKE ROAD WRIGHTSTOWN Wisconsin 54180 10003 Pacific Pet 29th Street OGDEN Utah 84401 10003 PETSMART, INC. 16091 NORTH 25TH AVENUE PHOENA Arizona 85027 10003 THE PET FIRM 16418 NORTH 37TH STREET PHOENA Arizona 85503
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
最好的解决方案是使用 CASS 认证的地址标准化程序或服务来格式化和验证地址。除了拥有这方面工具的 USPS 之外,还有许多第三方程序或服务提供此功能。地址解析比您想象的要复杂得多,因此尝试发起一些查询来完成它将会充满危险。
Google 地理编码是另一个地方看看。。显然,谷歌要求您显示结果才能使用他们的地理编码服务。这就需要使用专用的地址解析器,例如 USPS 或第三方程序。The best solution is to use a CASS certified address standardization program or service that will format and validate the address. Beyond the USPS which has tools for this, there are many third-party programs or services which provide this functionality. Address parsing is far more complicated than you might imagine and thus trying whip up a few queries to do it will be fraught with peril.
Google's Geocoding is another place to look.. Apparently Google requires you display the results to use their Geocoding service. That leaves using dedicated address parsers like the USPS or a third-party program.使用 group by
soundex(name)
你会得到这样的结果。您必须对您的数据进行测试,以确定这对您的情况是否有帮助。我无法在 SQL Server 2000 上对此进行测试,因此我不确定 soundex 是否可用。Using group by
soundex(name)
you will get result like this. You have to test on your data to figure out if this is helpful in your situation or not. I can not test this on SQL Server 2000 so I am not sure if soundex is available.基于您提到的相同答案,请尝试以下操作:
Base on the same answer you mentioned, try this:
我们就从这里开始吧,哈哈……
We'll here is a start, Haha...
为了工作,我帮助编写进行地址验证的软件(针对 SmartyStreets)。我想回应 Thomas 的回答,唯一实用且有效的解决方案是使用经过 CASS 认证的供应商。这是非常复杂的,但是这些服务会为您做这件事并且做得很好。
我还要补充一点,大多数免费 API 都有许可证限制,阻止使用其服务来处理地址列表(Google 不是唯一的 - 甚至 USPS 也对其 API 的使用有限制)。
我会推荐类似 LiveAddress 或 CASS 认证擦洗 满足您的需求(后者可能最适合现有的桌子),但我会让您自己进行研究,以便您了解更多信息。我很乐意亲自帮助您解决任何与地址相关的问题。
For work, I help write software that does address verification (for SmartyStreets). I'd like to echo Thomas' answer in that the only practical and effective solution would be to use a CASS-Certified vendor. It is highly complicated, but those services will do it for you and do it well.
I'll also add that most free APIs have license restrictions that prevent the use of their service for processing lists of addresses (Google isn't the only one -- even the USPS has restrictions for use of their API).
I would recommend a service like LiveAddress or CASS-Certified Scrubbing for your needs (the latter probably best for an existing table), but I'll let you do your own research so you're more informed. I'll be happy to help you personally with any more address-related questions.
OpenRefine 可能就是您的答案。
但是您需要导出到 csv,然后在 openrefine 中进行清理,然后将其导入回 SQL Server。
OpenRefine may be your answer.
But you need to export to csv and then cleanup in openrefine and the import it back to SQL Server.