使用 t-sql 或 SSIS 解析地址字段的最佳方法是什么?
我有一个数据集,每晚都会导入到 SQL 表中。其中一个字段是“Address_3”,包含城市、州、邮政编码和国家/地区字段。然而,这些数据并未标准化。如何最好地将当前进入 1 个字段的数据解析为各个字段。以下是我可能收到的数据的一些示例:
'INDIANAPOLIS, IN 46268 US'
'INDIANAPOLIS, IN 46268-1234 US'
'INDIANAPOLIS, IN 46268-1234'
'INDIANAPOLIS, IN 46268'
提前致谢! 大卫
I have a data set that I import into a SQL table every night. One field is 'Address_3' and contains the City, State, Zip and Country fields. However, this data isn't standardized. How can I best parse the data that is currently going into 1 field into individual fields. Here are some examples of the data I might receive:
'INDIANAPOLIS, IN 46268 US'
'INDIANAPOLIS, IN 46268-1234 US'
'INDIANAPOLIS, IN 46268-1234'
'INDIANAPOLIS, IN 46268'
Thanks in advance!
David
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
在 SSIS 中,您可以有 4 个派生列(城市、州、邮政编码、国家/地区)。
您可以看到上面的模式并相应地继续。这可能会变得有点复杂。您可以使用脚本组件更好地提取文本行。
In SSIS you can have 4 derived columns (city,state,zip,country).
You can see the pattern above and continue accordingly. This might get a bit complicated. You can use a Script Component to better pull out the lines of text.
像这样的东西应该有帮助:
something like this should help:
我已经做了类似的事情(不是在 T-SQL 中),我发现从字符串末尾开始向后工作效果最好。
抓住第二个最右边的元素直到下一个空格或逗号
抓住最后一个逗号之前的所有内容并将其称为城市。
您需要根据输入数据的情况进行一些调整,但基本思想是从右侧开始,获取可以轻松分类的元素,并将其他所有内容称为城市。
您可以通过使用 REVERSE 函数来实现类似的搜索更容易(在这种情况下,您将从左到右解析字符串,而不是像我上面所说的从右到左), PATINDEX 或 CHARINDEX< /a> 函数查找空格和逗号,以及 SUBSTRING 函数根据 PATINDEX 和 CHARINDEX 找到的位置将地址分开。您可以使用 ASCII 函数来确定字符是数字还是不是。
您也使用 SSIS 标签标记了您的问题 - 在 SSIS 中的某些 VB 脚本中实现解析可能比尝试使用 T-SQL 更容易。
I've done something similar (not in T-SQL) and I find it works best to start at the end of the string and work backwards.
Grab the second rightmost element up to the next space or comma
Grab everything else preceding the last comma and call it the city.
You'll need to make some adjustments based on what your input data looks like but the basic idea is to start from the right, grab the elements you can easily classify and call everything else the city.
You can implement something like this by using the REVERSE function to make searching easier (in which case you'll be parsing the string from left to right instead of right to left like I said above), the PATINDEX or CHARINDEX functions to find spaces and commas, and the SUBSTRING function to pull the address apart based on the positions found by PATINDEX and CHARINDEX. You could use the ASCII function to determine if a character is numeric or not.
You tagged your question with the SSIS tag as well - it might be easier to implement the parsing in some VB script in SSIS rather than try to do it with T-SQL.
到目前为止,最好的方法是不要重新发明轮子并获得地址解析和标准化引擎。理想情况下,您应该使用经 CASS 认证的发动机,即经邮政局批准的发动机。然而,现在网络上有免费的地址解析器,其中任何一个都会比尝试自己解析地址更准确且更不令人沮丧。
也就是说,我会说地址解析器和邮局是自下而上工作的(因此,国家/地区,然后是邮政编码,然后是城市,然后是州,然后是地址行 2 等)。
By far the best way is to not reinvent the wheel and get an address parsing and standardization engine. Ideally, you would use a CASS certified engine which is what is approved by the Postal Service. However, there are free address parsers on the net these days and any of those would be more accurate and less frustrating than trying to parse the address yourself.
That said, I will say that address parsers and the Post Office work from bottom up (So, country, then zip code, then city, then state then address line 2 etc.).