当前位置：文江博客话题详情

Java：解析澳大利亚街道地址

发布于 2024-08-23 11:57:52 字数 512 浏览 13 评论 0原文

寻找一种快速而肮脏的方法来将澳大利亚街道地址解析为各个部分：
3A/45 Jindabyne Rd, Oakleigh, VIC 3166

应分为：
“3A”、45、“金德拜恩路” “奥克利”、“VIC”, 3166

郊区名称可以有多个单词，街道名称也可以。

请参阅：将 Steet 地址解析为组件

必须使用 Java，不能使用http 请求（例如 Web API）。

编辑：假设始终遵循指定的格式。我不介意向用户吐出格式不正确的字符串，并附上一条消息，告诉他们遵循格式（我上面已经描述过）。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

分开我的手 2024-08-30 11:57:52

老实说，你在这里给自己设定了一个相当西西弗斯式的挑战，我不确定这是否值得。除非您的数据来自已知来源，并且具有非常明确的格式，否则您将获得完全无用的数据。如果您处理的是免费文本，人们会以您无法相信的方式搞砸他们的地址。

您是否真的想（自己）尝试解析 Richmond, Victoria, 3121 和 Richmond 3121 VIC 以及 Richmond VIC, 3121 的所有可能组合ETC？这只是郊区粒度！

地址更糟糕。当然，大多数人会将 7/21 Smith St 表示为一个单元，或将 29-33 Jones St 表示为跨越多个门牌号的位置，但人们并不一致。 1-5 Brown St 是位于 5 号的 1 号单元，还是该街道上从 1 号到 5 号的位置？ 7A 是一个单独的细分街道地址，还是#7 的 A 单元？

地址匹配不是一个简单的问题，如果您的数据集是最终用户输入的自由文本，我真的不会打扰，除非您的数据量很少或者不关心准确性很多（或者，有很多时间进行手动清理）。如果没有，请将其交给一个可以为您完成此工作的软件。

澳大利亚邮政有一个名为邮政地址文件 (PAF)< /a> 其中包含澳大利亚的每个有效交货地点。有许多软件库可以为您进行解析+匹配，并为您提供明确的答案（包括您所追求的所有单独的地址组件）或提供潜在匹配列表供您选择如果地址不存在或不明确。我知道的一个例子是 QAS Batch< /a> （与他们没有任何关系，过去评估过他们的软件，但最终没有使用它）但这只是一个例子；通过 PAF 网站可以找到其他人的列表。

强烈建议您不要在这方面浪费时间，除非规模很小。

如果是的话，嘿，是的，正则表达式。

回复收藏 0 原文

难如初 2024-08-30 11:57:52

鉴于您对我的其他答案的回复，这应该适用于您指定的严格格式的情况：

    String sample = "3A/45 Jindabyne Rd, Oakleigh, VIC 3166";
    Pattern pattern = Pattern.compile("(([^/ ]+)/)?([^ ]+) ([^,]+), ([^,]+), ([^ ]+) (\\d+)");
    Matcher m = pattern.matcher(sample);
    if (m.find()) {
        System.out.println("Unit: " + m.group(2));
        System.out.println("Number: " + m.group(3));
        System.out.println("Street: " + m.group(4));
        System.out.println("Suburb: " + m.group(5));
        System.out.println("State: " + m.group(6));
        System.out.println("Postcode: " + m.group(7));
    } else {
        throw new IllegalArgumentException("WTF");
    }

如果您删除“3A/”（在这种情况下 m.group(2) 将为 null ），如果街道号码是“45A”或“45-47”，如果我们向道路（“Jindabyne East Rd”）或郊区（“Oakleigh South”）添加一个空格。

只是为了进一步解释该正则表达式，如果您不熟悉正则表达式：

(([^/ ]+)/)? 相当于 ([^/ ]+/ )?——即“任何不包括正斜杠或空格，后跟斜杠的内容”。问号使其成为可选（因此整个子句可能会丢失），最终版本中的额外括号是为了创建一个较小的内部组（不带斜线），以供以后提取。

([^ ]+) 是“捕获任何非空格（后跟空格）的内容”——这是街道号码。

([^,]+), 是“捕获不是逗号的任何内容（后跟逗号和空格）”——这是街道名称。街道名称中的任何内容都有效，只要不是逗号即可。

([^,]+), 再次相同，在本例中是为了捕获郊区。

([^ ]+) 捕获下一个非空格字符串（状态缩写）并跳过其后面的空格。

(\\d+) 通过捕获任意数量的数字（邮政编码）进行四舍五入

希望这有帮助。

Given your reply to my other answer, this should do for the strictly-formatted case you specify:

    String sample = "3A/45 Jindabyne Rd, Oakleigh, VIC 3166";
    Pattern pattern = Pattern.compile("(([^/ ]+)/)?([^ ]+) ([^,]+), ([^,]+), ([^ ]+) (\\d+)");
    Matcher m = pattern.matcher(sample);
    if (m.find()) {
        System.out.println("Unit: " + m.group(2));
        System.out.println("Number: " + m.group(3));
        System.out.println("Street: " + m.group(4));
        System.out.println("Suburb: " + m.group(5));
        System.out.println("State: " + m.group(6));
        System.out.println("Postcode: " + m.group(7));
    } else {
        throw new IllegalArgumentException("WTF");
    }

This works if you remove the '3A/' (in which case m.group(2) will be null), if the street number is '45A' or '45-47', if we add a space to the road ('Jindabyne East Rd') or to the suburb ('Oakleigh South').

Just to explain that regex further, if you're not familiar with regular expressions:

(([^/ ]+)/)? is the equivalent of just ([^/ ]+/)? -- that is, 'anything not including a forward slash or a space, followed by a slash'. The question mark makes it optional (so the whole clause can be missing), and the extra parentheses in the final version are to create a smaller inner group, without the slash, for later extraction.

([^ ]+) is 'capture anything that's not a space (which is followed by a space)' -- this is the street number.

([^,]+), is 'capture anything that's not a comma (which is followed by comma and space)' -- this is the street name. Anything is valid in the street name as long as it's not a comma.

([^,]+), is the same again, in this case to capture the suburb.

([^ ]+) captures the next non-space string (state abbrevation) and skips the space after it.

(\\d+) rounds off by capturing any number of digits (the postcode)

Hope that's helpful.

回复收藏 0 原文

莳間冲淡了誓言ζ 2024-08-30 11:57:52

嗯，可能相当困难，因为格式没有明确定义。

正则表达式肯定会作为一种快速而肮脏的解决方案。问题是在特殊情况下它可能会失败（产生不正确的结果）。

最好的选择可能是破解一个小的正则表达式，然后在实际的数据集（最好是生产中拥有的所有数据）上运行它，并检查它是否给出良好的结果。可能需要大量的手动工作，但可能是您能做的最好的...

编辑：顺便说一句，要在 Java 中使用正则表达式，请使用包 java.util.regex.只是想我会提一下...

回复收藏 0 原文

手长情犹 2024-08-30 11:57:52

如果有人感兴趣，我编写了以下正则表达式来解析澳大利亚地址。

r"(?i)(\b(PO BOX|post box)[,\s|.\s|,.|\s]*)?(\b(\d+))(\b(?:(?!\s{2,}).){1,60})\b(New South Wales|Victoria|Queensland|Western Australia|South Australia|Tasmania|VIC|NSW|ACT|QLD|NT|SA|TAS|WA).?[,\s|.\s|,.|\s]*(\b\d{4}).?[,\s|.\s|,.|\s]*(\b(Australia|Au))?")

这个用于解析 Nex Zealand 地址。

r"(?i)(\b(PO BOX|post box)[,\s|.\s|,.|\s]*)?(\b(\d+))(\b(?:(?!\s{2,}).){1,60})\b(Northland|Auckland|Waikato|Bay of Plenty|Gisborne|Hawke's Bay|Taranaki|Manawatu-Whanganui|Wellington|Tasman|Nelson|Marlborough|West Coast|Canterbury|Otago|Southland).?[,\s|.\s|,.|\s]*(\b\d{4}).?[,\s|.\s|,.|\s]*(\b(New zealand|Newzealand|Nz))?")

If anyone interested I wrote the following regex to parse Australia addresses.

r"(?i)(\b(PO BOX|post box)[,\s|.\s|,.|\s]*)?(\b(\d+))(\b(?:(?!\s{2,}).){1,60})\b(New South Wales|Victoria|Queensland|Western Australia|South Australia|Tasmania|VIC|NSW|ACT|QLD|NT|SA|TAS|WA).?[,\s|.\s|,.|\s]*(\b\d{4}).?[,\s|.\s|,.|\s]*(\b(Australia|Au))?")

And this one for parse Nex Zealand addresses.

r"(?i)(\b(PO BOX|post box)[,\s|.\s|,.|\s]*)?(\b(\d+))(\b(?:(?!\s{2,}).){1,60})\b(Northland|Auckland|Waikato|Bay of Plenty|Gisborne|Hawke's Bay|Taranaki|Manawatu-Whanganui|Wellington|Tasman|Nelson|Marlborough|West Coast|Canterbury|Otago|Southland).?[,\s|.\s|,.|\s]*(\b\d{4}).?[,\s|.\s|,.|\s]*(\b(New zealand|Newzealand|Nz))?")

回复收藏 0 原文