用于匹配英国邮政编码的正则表达式

正如许多开发人员可能所做的那样，他们复制/粘贴代码（尤其是正则表达式）并粘贴它们，期望它们能够工作。虽然这在理论上很好，但在这种特殊情况下会失败，因为从该文档复制/粘贴实际上会将其中一个字符（空格）更改为换行符，如下所示：

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z]))))
[0-9][A-Za-z]{2})$

大多数开发人员要做的第一件事就是删除换行符不假思索。现在，正则表达式不会匹配包含空格的邮政编码（GIR 0AA 邮政编码除外）。

要解决此问题，应将换行符替换为空格字符：

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$
                                                                                                                                                     ^

问题 2 - 边界

请参阅此处使用的正则表达式。

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$
^^                     ^ ^                                                                                                                                            ^^

邮政编码正则表达式不正确地锚定正则表达式。如果像 fooA11 1AA 这样的值通过，使用此正则表达式验证邮政编码的任何人可能会感到惊讶。这是因为它们锚定了第一个选项的开头和第二个选项的结尾（彼此独立），如上面的正则表达式中所指出的。

这意味着 ^ （断言行首位置）仅适用于第一个选项 ([Gg][Ii][Rr] 0[Aa]{2})< /code>，因此第二个选项将验证邮政编码中结尾的任何字符串（无论前面是什么）。

同样，第一个选项未锚定到行尾 $，因此 GIR 0AAfoo 也被接受。

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z]))))[0-9][A-Za-z]{2})$

要解决此问题，这两个选项都应包含在另一个组（或非捕获组）中，并将锚点放置在该组周围：

^(([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2}))$
^^                                                                                                                                                                      ^^

问题 3 - 字符集不正确

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$
                                                                                       ^^

正则表达式缺少 - 来指示字符范围。按照目前的情况，如果邮政编码的格式为 ANA NAA（其中 A 代表字母，N 代表数字），则它开始使用 A 或 Z 以外的任何内容，都会失败。

这意味着它将匹配 A1A 1AA 和 Z1A 1AA，但不匹配 B1A 1AA。

要解决此问题，应将字符 - 放置在相应字符集中的 A 和 Z 之间：

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$
                                                                                        ^

问题 4 - 错误的可选字符集

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$
                                                                                                                                        ^

我发誓他们在网上发布之前甚至没有测试过这个东西。他们将错误的字符集设置为可选。他们在选项2（第9组）的第四个子选项中制作了[0-9]选项。这允许正则表达式匹配格式不正确的邮政编码，例如 AAA 1AA。

要解决此问题，请将下一个字符类设置为可选（然后使 [0-9] 集仅匹配一次）：

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?)))) [0-9][A-Za-z]{2})$
                                                                                                                                                ^

问题 5 - 性能

此正则表达式的性能非常差。首先，他们在开头放置了最不可能匹配 GIR 0AA 的模式选项。与任何其他邮政编码相比，有多少用户可能拥有此邮政编码；可能永远不会？这意味着每次使用正则表达式时，它必须先耗尽此选项，然后再继续下一个选项。要了解性能受到的影响，请检查原始正则表达式针对 < a href="https://regex101.com/r/ajQHrd/6" rel="noreferrer">翻转选项后的相同正则表达式 (22)。

性能的第二个问题是由于整个正则表达式的结构方式造成的。如果一个选项失败了，就没有必要对每个选项进行回溯。当前正则表达式的结构方式可以大大简化。我在答案部分中提供了对此的修复。

问题 6 - 空格

查看此处使用的正则表达式

这可能不被视为问题 em> 本身，但这确实引起了大多数开发人员的担忧。正则表达式中的空格不是可选的，这意味着输入邮政编码的用户必须在邮政编码中添加空格。这是一个简单的修复方法，只需在空格后添加 ? 即可将它们呈现为可选。请参阅答案部分进行修复。

答案

1. 修复英国政府的正则表达式

修复问题部分中概述的所有问题并简化模式会产生以下更短、更简洁的模式。我们还可以删除大多数组，因为我们将邮政编码作为一个整体（而不是各个部分）进行验证：

参见此处使用的正则表达式

^([A-Za-z][A-Ha-hJ-Yj-y]?[0-9][A-Za-z0-9]? ?[0-9][A-Za-z]{2}|[Gg][Ii][Rr] ?0[Aa]{2})$

通过删除其中一种情况（大写或小写）中的所有范围并使用不区分大小写的标志，可以进一步缩短该范围。注意：有些语言没有，因此请使用上面较长的一个。每种语言以不同的方式实现不区分大小写标志。

^([A-Z][A-HJ-Y]?[0-9][A-Z0-9]? ?[0-9][A-Z]{2}|GIR ?0A{2})$

再次用 \d 替换 [0-9] 更短（如果您的正则表达式引擎支持它）：

^([A-Z][A-HJ-Y]?\d[A-Z\d]? ?\d[A-Z]{2}|GIR ?0A{2})$

2. 简化模式

在不确保特定字母字符的情况下，可以使用以下模式（请记住此处也应用了1.修复英国政府的正则表达式中的简化）：

^([A-Z]{1,2}\d[A-Z\d]? ?\d[A-Z]{2}|GIR ?0A{2})$

更进一步，如果您不关心特殊情况 GIR 0AA：

^[A-Z]{1,2}\d[A-Z\d]? ?\d[A-Z]{2}$

3. 复杂的模式

我不建议过度验证邮政编码，因为新的区域、区和分区可能会出现在任何地方时间点。我建议可能做的是增加对边缘情况的支持。存在一些特殊情况，并在这篇维基百科文章中进行了概述。

以下是复杂的正则表达式，其中包括 3. 的小节（3.1、3.2、3.3）。

与 1 中的模式相关。修复英国政府的正则表达式：

查看此处使用的正则表达式

^(([A-Z][A-HJ-Y]?\d[A-Z\d]?|ASCN|STHL|TDCU|BBND|[BFS]IQQ|PCRN|TKCA) ?\d[A-Z]{2}|BFPO ?\d{1,4}|(KY\d|MSR|VG|AI)[ -]?\d{4}|[A-Z]{2} ?\d{2}|GE ?CX|GIR ?0A{2}|SAN ?TA1)$

以及与2. 简化模式：

查看此处使用的正则表达式

^(([A-Z]{1,2}\d[A-Z\d]?|ASCN|STHL|TDCU|BBND|[BFS]IQQ|PCRN|TKCA) ?\d[A-Z]{2}|BFPO ?\d{1,4}|(KY\d|MSR|VG|AI)[ -]?\d{4}|[A-Z]{2} ?\d{2}|GE ?CX|GIR ?0A{2}|SAN ?TA1)$

3.1 英国海外领土维基

百科文章目前指出（某些格式略有简化）：

AI-1111：安吉拉
ASCN 1ZZ：阿森松岛
STHL 1ZZ：圣赫勒拿
TDCU 1ZZ code>：特里斯坦达库尼亚
BBND 1ZZ：英属印度洋领地
BIQQ 1ZZ：英属南极领地
FIQQ 1ZZ：福克兰群岛
GX11 1ZZ ：直布罗陀
PCRN 1ZZ：皮特凯恩群岛
SIQQ 1ZZ：南乔治亚岛和南桑威奇群岛
TKCA 1ZZ：特克斯和凯科斯群岛
<代码>BFPO 11：阿克罗蒂里和德凯利亚
ZZ 11 & GE CX：百慕大（根据本文档)
KY1-1111：开曼群岛（根据本文档)
VG1111：英属维尔京群岛（根据本文档)
MSR 1111 ：蒙特塞拉特（根据本文件)

仅匹配英国海外领土的包罗万象的正则表达式可能如下所示：

^((ASCN|STHL|TDCU|BBND|[BFS]IQQ|GX\d{2}|PCRN|TKCA) ?\d[A-Z]{2}|(KY\d|MSR|VG|AI)[ -]?\d{4}|(BFPO|[A-Z]{2}) ?\d{2}|GE ?CX)$

3.2 英国军队邮局

尽管最近已将其更改为 BF#（其中 # 代表数字），以更好地与英国邮政编码系统保持一致，但它们仍被视为可选的替代邮政编码。这些邮政编码遵循 BFPO 的格式，后跟 1-4 位数字：

请参阅此处使用的正则表达式

^BFPO ?\d{1,4}$

3.3 圣诞老人？

圣诞老人还有另一个特殊情况（如其他答案中所述）：SAN TA1 是有效的邮政编码。正则表达式非常简单：

^SAN ?TA1$

I recently posted an answer to this question on UK postcodes for the R language. I discovered that the UK Government's regex pattern is incorrect and fails to properly validate some postcodes. Unfortunately, many of the answers here are based on this incorrect pattern.

I'll outline some of these issues below and provide a revised regular expression that actually works.

Note

My answer (and regular expressions in general):

Only validates postcode formats.
Does not ensure that a postcode legitimately exists.
- For this, use an appropriate API! See Ben's answer for more info.

_{If you don't care about the bad regex and just want to skip to the answer, scroll down to the Answer section.}

The Bad Regex

The regular expressions in this section should not be used.

This is the failing regex that the UK government has provided developers (not sure how long this link will be up, but you can see it in their Bulk Data Transfer documentation):

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z]))))[0-9][A-Za-z]{2})$

Problems

Problem 1 - Copy/Paste

As many developers likely do, they copy/paste code (especially regular expressions) and paste them expecting them to work. While this is great in theory, it fails in this particular case because copy/pasting from this document actually changes one of the characters (a space) into a newline character as shown below:

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z]))))
[0-9][A-Za-z]{2})$

The first thing most developers will do is just erase the newline without thinking twice. Now the regex won't match postcodes with spaces in them (other than the GIR 0AA postcode).

To fix this issue, the newline character should be replaced with the space character:

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$
                                                                                                                                                     ^

Problem 2 - Boundaries

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$
^^                     ^ ^                                                                                                                                            ^^

The postcode regex improperly anchors the regex. Anyone using this regex to validate postcodes might be surprised if a value like fooA11 1AA gets through. That's because they've anchored the start of the first option and the end of the second option (independently of one another), as pointed out in the regex above.

What this means is that ^ (asserts position at start of the line) only works on the first option ([Gg][Ii][Rr] 0[Aa]{2}), so the second option will validate any strings that end in a postcode (regardless of what comes before).

Similarly, the first option isn't anchored to the end of the line $, so GIR 0AAfoo is also accepted.

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z]))))[0-9][A-Za-z]{2})$

To fix this issue, both options should be wrapped in another group (or non-capturing group) and the anchors placed around that:

^(([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2}))$
^^                                                                                                                                                                      ^^

Problem 3 - Improper Character Set

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$
                                                                                       ^^

The regex is missing a - here to indicate a range of characters. As it stands, if a postcode is in the format ANA NAA (where A represents a letter and N represents a number), and it begins with anything other than A or Z, it will fail.

That means it will match A1A 1AA and Z1A 1AA, but not B1A 1AA.

To fix this issue, the character - should be placed between the A and Z in the respective character set:

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([A-Za-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$
                                                                                        ^

Problem 4 - Wrong Optional Character Set

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9]?[A-Za-z])))) [0-9][A-Za-z]{2})$
                                                                                                                                        ^

I swear they didn't even test this thing before publicizing it on the web. They made the wrong character set optional. They made [0-9] option in the fourth sub-option of option 2 (group 9). This allows the regex to match incorrectly formatted postcodes like AAA 1AA.

To fix this issue, make the next character class optional instead (and subsequently make the set [0-9] match exactly once):

^([Gg][Ii][Rr] 0[Aa]{2})|((([A-Za-z][0-9]{1,2})|(([A-Za-z][A-Ha-hJ-Yj-y][0-9]{1,2})|(([AZa-z][0-9][A-Za-z])|([A-Za-z][A-Ha-hJ-Yj-y][0-9][A-Za-z]?)))) [0-9][A-Za-z]{2})$
                                                                                                                                                ^

Problem 5 - Performance

Performance on this regex is extremely poor. First off, they placed the least likely pattern option to match GIR 0AA at the beginning. How many users will likely have this postcode versus any other postcode; probably never? This means every time the regex is used, it must exhaust this option first before proceeding to the next option. To see how performance is impacted check the number of steps the original regex took (35) against the same regex after having flipped the options (22).

The second issue with performance is due to the way the entire regex is structured. There's no point backtracking over each option if one fails. The way the current regex is structured can greatly be simplified. I provide a fix for this in the Answer section.

Problem 6 - Spaces

This may not be considered a problem, per se, but it does raise concern for most developers. The spaces in the regex are not optional, which means the users inputting their postcodes must place a space in the postcode. This is an easy fix by simply adding ? after the spaces to render them optional. See the Answer section for a fix.

Answer

1. Fixing the UK Government's Regex

Fixing all the issues outlined in the Problems section and simplifying the pattern yields the following, shorter, more concise pattern. We can also remove most of the groups since we're validating the postcode as a whole (not individual parts):

^([A-Za-z][A-Ha-hJ-Yj-y]?[0-9][A-Za-z0-9]? ?[0-9][A-Za-z]{2}|[Gg][Ii][Rr] ?0[Aa]{2})$

This can further be shortened by removing all of the ranges from one of the cases (upper or lower case) and using a case-insensitive flag. Note: Some languages don't have one, so use the longer one above. Each language implements the case-insensitivity flag differently.

^([A-Z][A-HJ-Y]?[0-9][A-Z0-9]? ?[0-9][A-Z]{2}|GIR ?0A{2})$

Shorter again replacing [0-9] with \d (if your regex engine supports it):

^([A-Z][A-HJ-Y]?\d[A-Z\d]? ?\d[A-Z]{2}|GIR ?0A{2})$

2. Simplified Patterns

Without ensuring specific alphabetic characters, the following can be used (keep in mind the simplifications from 1. Fixing the UK Government's Regex have also been applied here):

^([A-Z]{1,2}\d[A-Z\d]? ?\d[A-Z]{2}|GIR ?0A{2})$

And even further if you don't care about the special case GIR 0AA:

^[A-Z]{1,2}\d[A-Z\d]? ?\d[A-Z]{2}$

3. Complicated Patterns

I would not suggest over-verification of a postcode as new Areas, Districts and Sub-districts may appear at any point in time. What I will suggest potentially doing, is added support for edge-cases. Some special cases exist and are outlined in this Wikipedia article.

Here are complex regexes that include the subsections of 3. (3.1, 3.2, 3.3).

In relation to the patterns in 1. Fixing the UK Government's Regex:

^(([A-Z][A-HJ-Y]?\d[A-Z\d]?|ASCN|STHL|TDCU|BBND|[BFS]IQQ|PCRN|TKCA) ?\d[A-Z]{2}|BFPO ?\d{1,4}|(KY\d|MSR|VG|AI)[ -]?\d{4}|[A-Z]{2} ?\d{2}|GE ?CX|GIR ?0A{2}|SAN ?TA1)$

And in relation to 2. Simplified Patterns:

^(([A-Z]{1,2}\d[A-Z\d]?|ASCN|STHL|TDCU|BBND|[BFS]IQQ|PCRN|TKCA) ?\d[A-Z]{2}|BFPO ?\d{1,4}|(KY\d|MSR|VG|AI)[ -]?\d{4}|[A-Z]{2} ?\d{2}|GE ?CX|GIR ?0A{2}|SAN ?TA1)$

3.1 British Overseas Territories

The Wikipedia article currently states (some formats slightly simplified):

AI-1111: Anguila
ASCN 1ZZ: Ascension Island
STHL 1ZZ: Saint Helena
TDCU 1ZZ: Tristan da Cunha
BBND 1ZZ: British Indian Ocean Territory
BIQQ 1ZZ: British Antarctic Territory
FIQQ 1ZZ: Falkland Islands
GX11 1ZZ: Gibraltar
PCRN 1ZZ: Pitcairn Islands
SIQQ 1ZZ: South Georgia and the South Sandwich Islands
TKCA 1ZZ: Turks and Caicos Islands
BFPO 11: Akrotiri and Dhekelia
ZZ 11 & GE CX: Bermuda (according to this document)
KY1-1111: Cayman Islands (according to this document)
VG1111: British Virgin Islands (according to this document)
MSR 1111: Montserrat (according to this document)

An all-encompassing regex to match only the British Overseas Territories might look like this:

^((ASCN|STHL|TDCU|BBND|[BFS]IQQ|GX\d{2}|PCRN|TKCA) ?\d[A-Z]{2}|(KY\d|MSR|VG|AI)[ -]?\d{4}|(BFPO|[A-Z]{2}) ?\d{2}|GE ?CX)$

3.2 British Forces Post Office

Although they've been recently changed it to better align with the British postcode system to BF# (where # represents a number), they're considered optional alternative postcodes. These postcodes follow(ed) the format of BFPO, followed by 1-4 digits: