Java:解析澳大利亚街道地址
寻找一种快速而肮脏的方法来将澳大利亚街道地址解析为各个部分:3A/45 Jindabyne Rd, Oakleigh, VIC 3166
应分为:“3A”
、45
、“金德拜恩路”
“奥克利”
、“VIC”,
3166
郊区名称可以有多个单词,街道名称也可以。
请参阅:将 Steet 地址解析为组件
必须使用 Java,不能使用http 请求(例如 Web API)。
编辑:假设始终遵循指定的格式。我不介意向用户吐出格式不正确的字符串,并附上一条消息,告诉他们遵循格式(我上面已经描述过)。
Looking for a quick and dirty way to parse Australian street addresses into its parts:3A/45 Jindabyne Rd, Oakleigh, VIC 3166
should split into:"3A"
, 45
, "Jindabyne Rd"
"Oakleigh"
, "VIC"
, 3166
Suburb names can have multiple words, as can street names.
See: Parse A Steet Address into components
Has to be in Java, cannot make http requests (e.g. to web APIs).
EDIT: Assume that format specified is always followed. I have no issue with spitting incorrectly formatted strings back at the user with a message telling them to follow the format (which I've described above).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
老实说,你在这里给自己设定了一个相当西西弗斯式的挑战,我不确定这是否值得。除非您的数据来自已知来源,并且具有非常明确的格式,否则您将获得完全无用的数据。如果您处理的是免费文本,人们会以您无法相信的方式搞砸他们的地址。
您是否真的想(自己)尝试解析
Richmond, Victoria, 3121
和Richmond 3121 VIC
以及Richmond VIC, 3121
的所有可能组合ETC?这只是郊区粒度!地址更糟糕。当然,大多数人会将
7/21 Smith St
表示为一个单元,或将29-33 Jones St
表示为跨越多个门牌号的位置,但人们并不一致。1-5 Brown St
是位于 5 号的 1 号单元,还是该街道上从 1 号到 5 号的位置?7A
是一个单独的细分街道地址,还是#7 的 A 单元?地址匹配不是一个简单的问题,如果您的数据集是最终用户输入的自由文本,我真的不会打扰,除非您的数据量很少或者不关心准确性很多(或者,有很多时间进行手动清理)。如果没有,请将其交给一个可以为您完成此工作的软件。
澳大利亚邮政有一个名为邮政地址文件 (PAF)< /a> 其中包含澳大利亚的每个有效交货地点。有许多软件库可以为您进行解析+匹配,并为您提供明确的答案(包括您所追求的所有单独的地址组件)或提供潜在匹配列表供您选择如果地址不存在或不明确。我知道的一个例子是 QAS Batch< /a> (与他们没有任何关系,过去评估过他们的软件,但最终没有使用它)但这只是一个例子;通过 PAF 网站可以找到其他人的列表。
强烈建议您不要在这方面浪费时间,除非规模很小。
如果是的话,嘿,是的,正则表达式。
Honestly, you're setting yourself a rather Sisyphean challenge here, and I'm not sure if it's worthwhile. Unless your data comes from a known source, with a very well specified format, you're going to get data that's completely useless. If you're dealing with free text, people screw up their addresses in ways you wouldn't believe.
Do you really want to try (yourself) to parse every possible combination of
Richmond, Victoria, 3121
andRichmond 3121 VIC
andRichmond VIC, 3121
etc? And that's just suburb granularity!Addresses are even worse. Sure, most people would put
7/21 Smith St
for a unit, or29-33 Jones St
for a location spanning multiple street numbers, but people aren't consistent. Is1-5 Brown St
unit 1 at number 5, or a location spanning #1 to #5 on that street? Is7A
a separate subdivided street address, or Unit A at #7?Address matching is not a simple problem and if your data set is end-user-entered free text, I seriously wouldn't bother unless you have a trivial amount of data or don't care about accuracy that much (or, alternatively, have a lot of time for manual cleanups). If not, hand it off to a piece of software that does this work for you.
Australia Post have something called the Postal Address File (PAF) which contains every valid delivery location in Australia. There are a number of software libraries which will do the parsing + matching for you, and either give you a definitive answer (including all the individual address components, as you're after) or provide a list of potential matches for you to choose from if the address is non-existent or ambiguous. One example I'm aware of is QAS Batch (not affiliated with them in any way, evaluated their software in the past but didn't end up using it) but that's just one example; there's a list of others accessible through the PAF website.
Cannot recommend strongly enough that you don't waste your time on this unless it's at a trivial scale.
If it is, hey, yeah, regex.
鉴于您对我的其他答案的回复,这应该适用于您指定的严格格式的情况:
如果您删除“3A/”(在这种情况下
m.group(2)
将为 null ),如果街道号码是“45A”或“45-47”,如果我们向道路(“Jindabyne East Rd”)或郊区(“Oakleigh South”)添加一个空格。只是为了进一步解释该正则表达式,如果您不熟悉正则表达式:
(([^/ ]+)/)?
相当于([^/ ]+/ )?
——即“任何不包括正斜杠或空格,后跟斜杠的内容”。问号使其成为可选(因此整个子句可能会丢失),最终版本中的额外括号是为了创建一个较小的内部组(不带斜线),以供以后提取。([^ ]+)
是“捕获任何非空格(后跟空格)的内容”——这是街道号码。([^,]+),
是“捕获不是逗号的任何内容(后跟逗号和空格)”——这是街道名称。街道名称中的任何内容都有效,只要不是逗号即可。([^,]+),
再次相同,在本例中是为了捕获郊区。([^ ]+)
捕获下一个非空格字符串(状态缩写)并跳过其后面的空格。(\\d+)
通过捕获任意数量的数字(邮政编码)进行四舍五入希望这有帮助。
Given your reply to my other answer, this should do for the strictly-formatted case you specify:
This works if you remove the '3A/' (in which case
m.group(2)
will be null), if the street number is '45A' or '45-47', if we add a space to the road ('Jindabyne East Rd') or to the suburb ('Oakleigh South').Just to explain that regex further, if you're not familiar with regular expressions:
(([^/ ]+)/)?
is the equivalent of just([^/ ]+/)?
-- that is, 'anything not including a forward slash or a space, followed by a slash'. The question mark makes it optional (so the whole clause can be missing), and the extra parentheses in the final version are to create a smaller inner group, without the slash, for later extraction.([^ ]+)
is 'capture anything that's not a space (which is followed by a space)' -- this is the street number.([^,]+),
is 'capture anything that's not a comma (which is followed by comma and space)' -- this is the street name. Anything is valid in the street name as long as it's not a comma.([^,]+),
is the same again, in this case to capture the suburb.([^ ]+)
captures the next non-space string (state abbrevation) and skips the space after it.(\\d+)
rounds off by capturing any number of digits (the postcode)Hope that's helpful.
嗯,可能相当困难,因为格式没有明确定义。
正则表达式肯定会作为一种快速而肮脏的解决方案。问题是在特殊情况下它可能会失败(产生不正确的结果)。
最好的选择可能是破解一个小的正则表达式,然后在实际的数据集(最好是生产中拥有的所有数据)上运行它,并检查它是否给出良好的结果。可能需要大量的手动工作,但可能是您能做的最好的...
编辑:顺便说一句,要在 Java 中使用正则表达式,请使用包
java.util.regex.只是想我会提一下...
Hm, probably quite difficult because the format is not well defined.
A regex would certainly work as a quick&dirty solution. The problem is that it will probably fail (produce incorrect results) in special cases.
Best bet is probably to hack up a small regex, then run that over a realistic dataset (ideally everything you have in production), and check if it gives good results. May be a lot of manual work, but probably the best you can do...
Edit: BTW, to use regexes in Java, use the methods from package
java.util.regex
. Just thought I'd mention it...如果有人感兴趣,我编写了以下正则表达式来解析澳大利亚地址。
这个用于解析 Nex Zealand 地址。
If anyone interested I wrote the following regex to parse Australia addresses.
And this one for parse Nex Zealand addresses.
我创建了一个正则表达式,它提取地址组成部分(例如单位号码、街道号码、街道名称,包括郊区、州和邮政编码),这适用于澳大利亚地址,但可以轻松地针对其他地址进行自定义,这是唯一需要更新的其他地址地址是状态部分。
https://regex101.com/library/5bj4wi
I have created a regex which extracts the address components (e.g. unit number, street number, street name including the suburb, state and postcode) this works on Australian addresses but it can be easily customized for other addresses, the only thing to update for other addresses is the state part.
https://regex101.com/library/5bj4wi
您可以使用 String.split,首先使用
,
,然后使用.
或/
。You could use String.split, first with
,
, then with.
or/
.对于商业解决方案,您可以尝试 address-parser.com。
For a commercial solution, you could give address-parser.com a try.