使用 t-sql 或 SSIS 解析地址字段的最佳方法是什么?

发布于 2024-08-31 03:01:36 字数 287 浏览 12 评论 0原文

我有一个数据集,每晚都会导入到 SQL 表中。其中一个字段是“Address_3”,包含城市、州、邮政编码和国家/地区字段。然而,这些数据并未标准化。如何最好地将当前进入 1 个字段的数据解析为各个字段。以下是我可能收到的数据的一些示例:

'INDIANAPOLIS, IN 46268  US'
'INDIANAPOLIS, IN 46268-1234  US'
'INDIANAPOLIS, IN 46268-1234'
'INDIANAPOLIS, IN 46268'

提前致谢! 大卫

I have a data set that I import into a SQL table every night. One field is 'Address_3' and contains the City, State, Zip and Country fields. However, this data isn't standardized. How can I best parse the data that is currently going into 1 field into individual fields. Here are some examples of the data I might receive:

'INDIANAPOLIS, IN 46268  US'
'INDIANAPOLIS, IN 46268-1234  US'
'INDIANAPOLIS, IN 46268-1234'
'INDIANAPOLIS, IN 46268'

Thanks in advance!
David

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

我是男神闪亮亮 2024-09-07 03:01:37

在 SSIS 中,您可以有 4 个派生列(城市、州、邮政编码、国家/地区)。

substring(column,1,FINDSTRING(",",column,1)-1)  --city
substring(column,FINDSTRING(" ",column,1)+1,FINDSTRING("",column,2)-1) --state
substring(column,FINDSTRING(" ",column,2)+1,FINDSTRING(" ",column,3)-1) -- zip

您可以看到上面的模式并相应地继续。这可能会变得有点复杂。您可以使用脚本组件更好地提取文本行。

In SSIS you can have 4 derived columns (city,state,zip,country).

substring(column,1,FINDSTRING(",",column,1)-1)  --city
substring(column,FINDSTRING(" ",column,1)+1,FINDSTRING("",column,2)-1) --state
substring(column,FINDSTRING(" ",column,2)+1,FINDSTRING(" ",column,3)-1) -- zip

You can see the pattern above and continue accordingly. This might get a bit complicated. You can use a Script Component to better pull out the lines of text.

岁月流歌 2024-09-07 03:01:37

像这样的东西应该有帮助:

select substring(CityStateZip, 1, 
case when charindex(',',reverse(CityStateZip)) = 0 then len(CityStateZip) 
else len(CityStateZip) - charindex(',',reverse(CityStateZip)) end) as City,
LEFT(LTRIM(
SUBSTRING(CityStateZip, case when charindex(',',reverse(CityStateZip)) = 0 then len(CityStateZip) else 
len(CityStateZip) - charindex(',',reverse(CityStateZip))+2 end, LEN(CityStateZip)))
,2) as State,

SUBSTRING(CityStateZip, case when charindex(' ',reverse(CityStateZip)) = 0 then len(CityStateZip) else 
len(CityStateZip) - charindex(' ',reverse(CityStateZip))+2 end, LEN(CityStateZip)) as Zip

from   YourAddressTable

something like this should help:

select substring(CityStateZip, 1, 
case when charindex(',',reverse(CityStateZip)) = 0 then len(CityStateZip) 
else len(CityStateZip) - charindex(',',reverse(CityStateZip)) end) as City,
LEFT(LTRIM(
SUBSTRING(CityStateZip, case when charindex(',',reverse(CityStateZip)) = 0 then len(CityStateZip) else 
len(CityStateZip) - charindex(',',reverse(CityStateZip))+2 end, LEN(CityStateZip)))
,2) as State,

SUBSTRING(CityStateZip, case when charindex(' ',reverse(CityStateZip)) = 0 then len(CityStateZip) else 
len(CityStateZip) - charindex(' ',reverse(CityStateZip))+2 end, LEN(CityStateZip)) as Zip

from   YourAddressTable
甜柠檬 2024-09-07 03:01:36

我已经做了类似的事情(不是在 T-SQL 中),我发现从字符串末尾开始向后工作效果最好。

  • 抓取最右边的元素,直到第一个空格或逗号。
  • 它是已知的国家/地区代码吗?这是一个国家/地区
  • 如果不是,是否都是数字(包括连字符)?这是一个邮政编码。
  • 否则丢弃它

抓住第二个最右边的元素直到下一个空格或逗号

  • 它是两个字母字符字段吗?这是州

抓住最后一个逗号之前的所有内容并将其称为城市。

您需要根据输入数据的情况进行一些调整,但基本思想是从右侧开始,获取可以轻松分类的元素,并将其他所有内容称为城市。

您可以通过使用 REVERSE 函数来实现类似的搜索更容易(在这种情况下,您将从左到右解析字符串,而不是像我上面所说的从右到左), PATINDEXCHARINDEX< /a> 函数查找空格和逗号,以及 SUBSTRING 函数根据 PATINDEX 和 CHARINDEX 找到的位置将地址分开。您可以使用 ASCII 函数来确定字符是数字还是不是。

您也使用 SSIS 标签标记了您的问题 - 在 SSIS 中的某些 VB 脚本中实现解析可能比尝试使用 T-SQL 更容易。

I've done something similar (not in T-SQL) and I find it works best to start at the end of the string and work backwards.

  • Grab the rightmost element up to the first space or comma.
  • Is it a known country code? It's a country
  • If not, is it all numeric (including a hyphen)? It's a zip code.
  • Else discard it

Grab the second rightmost element up to the next space or comma

  • Is it a two alpha-character field? It's the state

Grab everything else preceding the last comma and call it the city.

You'll need to make some adjustments based on what your input data looks like but the basic idea is to start from the right, grab the elements you can easily classify and call everything else the city.

You can implement something like this by using the REVERSE function to make searching easier (in which case you'll be parsing the string from left to right instead of right to left like I said above), the PATINDEX or CHARINDEX functions to find spaces and commas, and the SUBSTRING function to pull the address apart based on the positions found by PATINDEX and CHARINDEX. You could use the ASCII function to determine if a character is numeric or not.

You tagged your question with the SSIS tag as well - it might be easier to implement the parsing in some VB script in SSIS rather than try to do it with T-SQL.

帅的被狗咬 2024-09-07 03:01:36

到目前为止,最好的方法是不要重新发明轮子并获得地址解析和标准化引擎。理想情况下,您应该使用经 CASS 认证的发动机,即经邮政局批准的发动机。然而,现在网络上有免费的地址解析器,其中任何一个都会比尝试自己解析地址更准确且更不令人沮丧。

也就是说,我会说地址解析器和邮局是自下而上工作的(因此,国家/地区,然后是邮政编码,然后是城市,然后是州,然后是地址行 2 等)。

By far the best way is to not reinvent the wheel and get an address parsing and standardization engine. Ideally, you would use a CASS certified engine which is what is approved by the Postal Service. However, there are free address parsers on the net these days and any of those would be more accurate and less frustrating than trying to parse the address yourself.

That said, I will say that address parsers and the Post Office work from bottom up (So, country, then zip code, then city, then state then address line 2 etc.).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文