Vim:解析来自全球各地的地址字段
简介
这篇文章很长,但我认为它很详尽。我希望这篇文章对其他人在教授复杂的 VIM 正则表达式时有所帮助(地址)。谢谢您的宝贵时间。
全球地址:
美国、加拿大和其他一些国家/地区在表单上提供了 5 个字段,然后以逗号分隔的格式显示,我需要进一步剖析。理想情况下,以逗号分隔的内容如下所示:
一些非常好的地方,111 街,美丽的小镇,州或省,邮编
其中 zip 可以是一系列数字(美国)或数字和字母(加拿大)。
人们总是会在文本框字段输入中添加额外的逗号,这会增加解析此数据的复杂性。例如:
一些非常好的地方,111 街,101 号套房,美丽的小镇,州或省,邮编
让此解析变得更加复杂的是,来自非美国和非加拿大国家的数据包含一个额外的逗号分隔字段,该字段以某种方式提供给他们 - 添加一个他们进入自己国家的地方。 (不,他们的条目没有“美国”或“加拿大”字段。因此,它是原始 5 个逗号分隔字段的“补充”。)例如:
建筑物的外文名称、街道名称、城市、邮政编码、国家/地区
“,,”通常为空,因为非美国国家/地区不会划分为州。是的,与上面描述的相同的“附加逗号”也发生在这里。
建筑物外文名称、十字路口、区、A街道名称、A城市、邮编、国家
解析策略:
国家名称永远不会包含数字,而美国或加拿大的邮政编码始终包含至少一些数字。如果您向后使用关于最后一个字段的内容的假设,那么您应该能够将国家/地区、邮政编码、州(如果不为空“,,”)、城市和街道放入其尊重位置 - 这是最重要的领域得到正确的。这些部分之外的任何内容都可以集中在第一行或两行中作为地址描述(即建筑物、名称、套房、十字路口等)。例如:
一些非常好的地方,111 街,101 号套房,美丽的小镇,可爱的州,数字和字母
- 最后一部分有一个数字(因此是美国或加拿大的地址)
- 总共有 6 个部分,所以比原来的 5 个多了
- 一个第 5-2 部分是邮政编码、州、城镇、地址...
- 6 减 5(原始)= 添加额外的地址 (Address2) 字段并将第一部分保留为标题,结果是:
标题:一些非常好的地方,地址 1:111 街,地址 2:101 套房,城镇:美丽的城镇,州/省:可爱的州,邮政编码:数字和字母,
而“111 街”或“101 套房”的位置可能存在差异“(地址 1 或地址 2),它至少将邮政编码、州、城市和地址集中在一起,并将第一部分保留为用于数据输入目的的电子邮件地址的“标题”。
在这种方法下,外部地址被解析为:
建筑物外文名称、十字路口、区、A街道名称、A 城市、邮政编码、国家/地区
- 最后一部分没有数字,所以它必须是一个 Country
- 这意味着,从右向左移动,第二部分是 zip
- 所以现在(国外)你有一个“原始 6 个部分”要从总数中减去示例中的 7 个
- 部分第 7 部分 = 国家/地区,第 6 部分 = 邮政编码,第 5 部分 = 州(国外地址上大部分为空白),第 4 部分 = 城市,第 3 部分 = 地址 1,第 2 部分 = 地址 2,第 1 部分 = 标题
- 我们知道使用两个地址字段,因为该示例有 7 个部分,而外部地址有 6 个部分的基础。基址之上的任意数量的部分都被添加到第二个地址2字段中。如果基本节计数以上有 3 个节,则它们将附加到地址2 字段内的每个节中。
编码
在使用 VIM 的这种方法中,我最初如何读取逗号分隔部分的数量(在捕获寄存器中的整个地址之后)?如何对一系列以逗号分隔的部分进行子匹配,而我不确定其存在的部分数量?
示例地址
如果您愿意提供帮助,这里有一些练习地址(美国和外国):
City Gas &电力 - Bldg 4, 222 Middle Park Ct, CP4120F, Dallas, Texas, 44984
MHG Engineering, Inc. Suite 200, 9899 Balboa Ave, San Diego, California, 92123-1502
SolarWind Turbines, 2nd Floor Conference Room, 2300 Ruffin Road, Seattle ,华盛顿,84444
123 航空, 2239 Industry Parkway, Salt Lake City, Utah, 55344
Ongwanda Gov't Resources, 6000 Portsmouth Avenue, Ottawa, Ontario, K7M 8A6
Graylang Seray Center, 6600 Haig Rd, Singapore, , 437848, 新加坡
Lot 459, Block 14, Jalan Sultan Tengah , 佩特拉再也, 古晋, , 93050, Malaysia
Virtual Steel, 1 Umgazi Rd Aspec Park, Pretoria, , 0075, 南非
Idiom Towers South, Fifth Floor, Jasmen Conference Room, 1500 Freedom Street, Pretoria, , 0002, 南非
Intro
This post is long, but I consider it thorough. I hope this post might be helpful (addresses) to others while teaching complex VIM regexes. Thank you for your time.
Worldwide addresses:
American, Canadian and a few other countries are offered 5 fields on a form, which is then displayed in a comma delimited format that I need to further dissect. Ideally, the comma-separated content looks like:
Some Really Nice Place, 111 Street, Beautiful Town, StateOrProvince, zip
where zip can be either a series of just numbers (US) or numbers and letters (Canada).
Invariably, people throw an extra comma into their text box field input and that adds some complexity to the parsing of this data. For example:
Some Really Nice Place, 111 Street, suite 101, Beautiful Town, StateOrProvince, zip
Further complicating this parse is that the data from non-US and non-Canadian countries contains an extra comma-delimited field that was somehow provided to them - adding a place for them to enter their country. (No, there is no "US" or "Canada" field for their entries. So, it's "in addition" to the original 5 comma-delimited fields.) Such as:
Foreign Name of Building, A street name, A City, ,zip, Country
The ",," is usually empty as non-US countries do are not segmented into states. And, yes, the same "additional commas" as described above happens here too.
Foreign Name of Building, cross streets, district, A street name, A City, ,zip, Country
Parsing Strategy:
A country name will never include a digit, whereas a US or Canadian zip will always have at least some digits. If you go backwards using this assumption about the contents of the last field then you should be able to place the country, zip, State (if not empty ",,"), City and Street into their respect positions - which are the most important fields to get right. Anything beyond those sections could be lumped together in the first or or two lines as descriptions of the address (i.e. building, name, suite, cross streets, etc). For example:
Some Really Nice Place, 111 Street, suite 101, Beautiful Town, Lovely State, Digits&Letters
- Last section has a digit (therefore a US or Canadian address)
- There a total of 6 sections, so that's one more than the original 5
- Knowing that sections 5-2 are zip, state, town, address...
- 6 minus 5 (original) = add an extra Address (Address2) field and leave the first section as the header, resulting in:
Header: Some Really Nice Place, Address1: 111 Street, Address2: Suite 101, Town: Beautiful Town, State/Province: Lovely State, Zip: Digits&Letters
Whereas there might be a discrepancy on where "111 Street" or "Suite 101" goes (Address1 or Address2), it at least gets the zip, state, city and address(s) lumped together and leaves the first section as the "Header" to the email address for data entry purposes.
Under this approach, foreign address get parsed like:
Foreign Name of Building, cross streets, district, A street name, A
City, ,zip, Country
- Last section has no digit, so it must be a Country
- That means, moving right to left, the second section is the zip
- So now (foreign) you have an "original 6 sections" to subtract from the total of 7 in the example
- 7th section = country, 6th = zip, 5th = state (mostly blank on foreign address), 4th = City, 3rd = address1, 2nd = address2, 1st = header
- We knew to use two address fields because the example had 7 sections and foreign addresses have a base of 6 sections. Any number of sections above the base are added to a second address2 field. If there are 3 sections above the base section count then they are appended to each inside the address2 field.
Coding
In this approach using VIM, how would I initially read the number of comma-delimited sections (after I've captured the entire address in a register)? How do I do submatch(es) on a series of comma-delimited sections for which I am not sure the number of sections that exist?
Example Addresses
Here are some practice address (US and Foreign) if you are so inclined to help:
City Gas & Electric - Bldg 4, 222 Middle Park Ct, CP4120F, Dallas, Texas, 44984
MHG Engineering, Inc. Suite 200, 9899 Balboa Ave, San Diego, California, 92123-1502
SolarWind Turbines, 2nd Floor Conference Room, 2300 Ruffin Road, Seattle, Washington, 84444
123 Aeronautics, 2239 Industry Parkway, Salt Lake City, Utah, 55344
Ongwanda Gov't Resources, 6000 Portsmouth Avenue, Ottawa, Ontario, K7M 8A6
Graylang Seray Center, 6600 Haig Rd, Singapore, , 437848, Singapore
Lot 459, Block 14, Jalan Sultan Tengah, Petra Jaya, Kuching, , 93050, Malaysia
Virtual Steel, 1 Umgazi Rd Aspec Park, Pretoria, , 0075, South Africa
Idiom Towers South, Fifth Floor, Jasmen Conference Room, 1500 Freedom Street, Pretoria, , 0002, South Africa
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
以下代码是草稿质量的 Vim 脚本(希望)实现了
问题中描述的地址解析例程。
下面的命令可用于测试上述解析例程。
(可以为
:global
命令提供一个范围,以通过更少的次数运行它测试地址线的数量。)
The following code is a draft-quality Vim script (hopefully) implementing the
address parsing routine described in the question.
The command below can be used to test the above parsing routines.
(One can provide a range to the
:global
command to run it through fewernumber of test address lines.)
也许您应该回顾一下有关世界各地地址的其他一些问题。美国和加拿大的制度非常系统化;大多数其他国家对批准的格式不太严格。您为美国和加拿大设计的任何内容几乎在您处理其他地址时都会遇到问题。
有可能还有其他相关问题:请参阅标签 street-address 了解其中一些问题。
Maybe you should review some of the other questions about addresses around the world. The USA and Canada are extraordinarily systematic with their systems; most other countries are a lot less rigorous about the approved formats. Anything you devise for the USA and Canada will run into issues almost immediately you deal with other addresses.
There are probably other related questions: see the tag street-address for some of them.