使用正则表达式解析地址

发布于 2024-08-22 05:05:58 字数 612 浏览 9 评论 0原文

我必须创建一个循环，并使用正则表达式填充 4 个变量中的任何一个

$address, $street, $town, $lot

循环将被输入一个可能包含信息的字符串就像下面的行

'123 any street, mytown' 或
'Lot 4 another road, thattown' 或
'Lot 2 96 other road, hertown' 或
'this ave, thistown' 或
'yourtown'

因为逗号后面的任何内容都是 $town 我认为

(.*), (.*)

这是第一个捕获可以用 (Lot \d*) (.*), (.*) 检查如果第一个捕获以数字开头，则为地址（如果单词带有空格，则为 $street）如果有一个词，那就是$town

原文

I have to create a loop, and with a regexp
populate any of the 4 variables

$address, $street, $town, $lot

The loop will be fed a string that may have info in it
like the lines below

'123 any street, mytown' or
'Lot 4 another road, thattown' or
'Lot 2 96 other road, her town' or
'this ave, this town' or
'yourtown'

since anything after a comma is the $town I thought

(.*), (.*)

then the first capture could be checked with (Lot \d*) (.*), (.*)
if the 1st capture starts with a number, then its the address (if word with white space its $street)
if one word, its just the $town

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

苄①跕圉湢 2024-08-29 05:05:58

看看 Geo::StreetAddress::US 如果这些是美国地址。

即使不是，该模块的源代码也应该让您了解解析自由形式街道地址所涉及的内容。

这是一个处理您发布的地址的脚本（更新，早期版本将批次和编号合并为一个字符串）：

#!/usr/bin/perl

use strict; use warnings;

local $/ = "";

my @addresses;

while ( my $address = <DATA> ) {
    chomp $address;
    $address =~ s/\s+/ /g;
    my (%address, $rest);
    ($address{town}, $rest) = map { scalar reverse }
                        split( / ?, ?/, reverse($address), 2 );

    {
        no warnings 'uninitialized';
        @address{qw(lot number street)} =
            $rest =~ /^(?:(Lot [0-9]) )?(?:([0-9]+) )?(.+)\z/;
    }
    push @addresses, \%address;
}

use Data::Dumper;
print Dumper \@addresses;

__DATA__
123 any street,
mytown

Lot 4 another road,
thattown

Lot 2 96 other road,
her town

yourtown

street,
town

输出：

$VAR1 = [
          {
            'lot' => undef,
            'number' => '123',
            'street' => 'any street',
            'town' => 'mytown'
          },
          {
            'lot' => 'Lot 4',
            'number' => undef,
            'street' => 'another road',
            'town' => 'thattown'
          },
          {
            'lot' => 'Lot 2',
            'number' => '96',
            'street' => 'other road',
            'town' => 'her town'
          },
          {
            'lot' => undef,
            'number' => undef,
            'street' => undef,
            'town' => 'yourtown'
          },
          {
            'lot' => undef,
            'number' => undef,
            'street' => 'street',
            'town' => 'town'
          }
        ];

Take a look at Geo::StreetAddress::US if these are U.S. addresses.

Even if they are not, the source of this module should give you an idea of what is involved in parsing free form street addresses.

Here is a script that handles the addresses you posted (updated, earlier version combined lot and number into one string):

#!/usr/bin/perl

use strict; use warnings;

local $/ = "";

my @addresses;

while ( my $address = <DATA> ) {
    chomp $address;
    $address =~ s/\s+/ /g;
    my (%address, $rest);
    ($address{town}, $rest) = map { scalar reverse }
                        split( / ?, ?/, reverse($address), 2 );

    {
        no warnings 'uninitialized';
        @address{qw(lot number street)} =
            $rest =~ /^(?:(Lot [0-9]) )?(?:([0-9]+) )?(.+)\z/;
    }
    push @addresses, \%address;
}

use Data::Dumper;
print Dumper \@addresses;

__DATA__
123 any street,
mytown

Lot 4 another road,
thattown

Lot 2 96 other road,
her town

yourtown

street,
town

Output:

$VAR1 = [
          {
            'lot' => undef,
            'number' => '123',
            'street' => 'any street',
            'town' => 'mytown'
          },
          {
            'lot' => 'Lot 4',
            'number' => undef,
            'street' => 'another road',
            'town' => 'thattown'
          },
          {
            'lot' => 'Lot 2',
            'number' => '96',
            'street' => 'other road',
            'town' => 'her town'
          },
          {
            'lot' => undef,
            'number' => undef,
            'street' => undef,
            'town' => 'yourtown'
          },
          {
            'lot' => undef,
            'number' => undef,
            'street' => 'street',
            'town' => 'town'
          }
        ];

回复收藏 0 原文

一桥轻雨一伞开 2024-08-29 05:05:58

我建议您不要尝试在单个正则表达式中完成所有这些操作，因为很难验证其正确性。

首先，我会在逗号处分开。逗号后面的内容就是$town，如果没有逗号，则整个字符串就是$town。

然后我会检查是否有任何批次信息并从字符串中提取它。

然后我会查找街道/大道号码和名称。

分而治之:)

回复收藏 0 原文

白云不回头 2024-08-29 05:05:58

这应该分为 3 部分 - 如何区分地址/街道？

(Lot \d*)? ?([^,]*,)? ?(.*)

这是您的示例的细分

('', '123 any street,', 'mytown')
('Lot 4', 'another road,', 'thattown')
('Lot 2', '96 other road,', 'her town')
('', 'this ave,', 'this town')
('', '', 'yourtown')

如果我理解正确的话，这也将地址/街道分开

(Lot \d*)? ?(\d*) ?([^,]*,)? ?(.*)

('', '123', 'any street,', 'mytown')
('Lot 4', '', 'another road,', 'thattown')
('Lot 2', '96', 'other road,', 'her town')
('', '', 'this ave,', 'this town')
('', '', '', 'yourtown')

This should separate into 3 parts - how do you distinguish the address/street?

(Lot \d*)? ?([^,]*,)? ?(.*)

here is the breakdown for your examples

('', '123 any street,', 'mytown')
('Lot 4', 'another road,', 'thattown')
('Lot 2', '96 other road,', 'her town')
('', 'this ave,', 'this town')
('', '', 'yourtown')

If I understand correctly, this one separates the address/street as well

(Lot \d*)? ?(\d*) ?([^,]*,)? ?(.*)

('', '123', 'any street,', 'mytown')
('Lot 4', '', 'another road,', 'thattown')
('Lot 2', '96', 'other road,', 'her town')
('', '', 'this ave,', 'this town')
('', '', '', 'yourtown')

回复收藏 0 原文

丘比特射中我 2024-08-29 05:05:58

我无法匹配最后一个，但对于前 3 个，您可以使用如下所示的内容：

if (preg_match('/(?:Lot (\d*)|)(?: |)(?:(\d*)|) (.*), (.*)/m', $subject, $regs)) {
    $result = $regs[1];
} else {
    $result = "";
}

这是测试正则表达式：

(?:Lot (\d*)|)(?: |)(?:(\d*)|) (.*), (.*)

您可以在 regexbuddy 中使用它来测试：链接

I can't match the last one but for the first 3 ones you can use something like this:

if (preg_match('/(?:Lot (\d*)|)(?: |)(?:(\d*)|) (.*), (.*)/m', $subject, $regs)) {
    $result = $regs[1];
} else {
    $result = "";
}

this is the testing regex:

(?:Lot (\d*)|)(?: |)(?:(\d*)|) (.*), (.*)

You can use this in regexbuddy to test: link

回复收藏 0 原文

沐歌 2024-08-29 05:05:58

Geo::StreetAddress::US 对于简单的地址来说很好，但对于更困难的示例可能会丢失上下文。它将解析街道名称，直到找到郊区。因此，“46 7th St. Johns Park”、“St.”消耗得太快，街道类型被错误地分配给“公园”，“CA”的 stae 成为郊区。

2 Smith St Suburb NJ 12345              2 Smith           St   Suburb          NJ 12345
25 MIRROR LAKE DR LITTLE EGG HARBOR    25 MIRROR LAKE DR  Hbr  NJ                     0
74B Old Bohema Rd N, St. Johns Park    74 B Old Bohema    Rd   St Johns Park   CA 95472
74 Mt Baw Baw Rd Suite C Some Park C   74 Mt Baw Baw Rd S Park CA                     0
74 Old Bohema Rd Bldg A Some Park CA   74 Old Bohema Rd B Park CA                     0
74 Old Bohema Rd Rm 123A Some Park C   74 Old Bohema Rd R Park CA                     0
Lot 74 Old Bohema Rd Some Park CA 95    0 Old Bohema Rd S Park CA                     0
22 Glen Alpine Way Some Park CA 9547   22 Glen Alpine Way Park CA                     0
4/6 Bohema Rd, St. Johns Park CA 954    4 6 Bohema        Rd   St Johns Park   CA 95472
46 The Parade, St. Johns Park CA 954   46 The                  Parade                 0
46 7th St. Johns Park CA 95472         46 7th St Johns    Park CA                     0
46 B Avenue Johns Park CA 95472        46 B Avenue Johns  Park CA                     0
46 Avenue C Johns Park CA 95472        46 Avenue C Johns  Park CA                     0
46 Broadway Johns Park CA 95472        46 Broadway Johns  Park CA                     0
46 State Route 19 Johns Park CA 9547   46 State Route 19  Park CA                     0
46 John F Kennedy Drive Johns Park C   46 John F Kennedy  Park CA                     0
PO Box 213 Somewhere IO 1234            0 Somewhere            IO                     0
1 BEACH DR SE # 2410 ST PETERSBURG F    1 BEACH DR SE # 2 St   PETERSBURG      FL 33701
# 123 12 BEACH DR SE ST PETERSBURG F   12 BEACH DR SE     St   PETERSBURG      FL 33701
46 Broad Street #12 Suburb CA 95472    46 Broad           St                          0

我开发了一个 Perl 模块，可以识别许多更困难的模式 https://metacpan.org/发布/Lingua-EN-AddressParse 。它可以识别“The Parade”、nth Street 等惯用语，以及“46 Broad Street #12”等子属性地址等。

Geo::StreetAddress::US is fine for simple addresses, but it can lose context on harder examples. It will parse street names up until it finds a suburb. So with " 46 7th St. Johns Park", 'St.' is consumed too soon, street type get incorrectly assigned to 'Park' and the stae of 'CA' becomes the suburb.

2 Smith St Suburb NJ 12345              2 Smith           St   Suburb          NJ 12345
25 MIRROR LAKE DR LITTLE EGG HARBOR    25 MIRROR LAKE DR  Hbr  NJ                     0
74B Old Bohema Rd N, St. Johns Park    74 B Old Bohema    Rd   St Johns Park   CA 95472
74 Mt Baw Baw Rd Suite C Some Park C   74 Mt Baw Baw Rd S Park CA                     0
74 Old Bohema Rd Bldg A Some Park CA   74 Old Bohema Rd B Park CA                     0
74 Old Bohema Rd Rm 123A Some Park C   74 Old Bohema Rd R Park CA                     0
Lot 74 Old Bohema Rd Some Park CA 95    0 Old Bohema Rd S Park CA                     0
22 Glen Alpine Way Some Park CA 9547   22 Glen Alpine Way Park CA                     0
4/6 Bohema Rd, St. Johns Park CA 954    4 6 Bohema        Rd   St Johns Park   CA 95472
46 The Parade, St. Johns Park CA 954   46 The                  Parade                 0
46 7th St. Johns Park CA 95472         46 7th St Johns    Park CA                     0
46 B Avenue Johns Park CA 95472        46 B Avenue Johns  Park CA                     0
46 Avenue C Johns Park CA 95472        46 Avenue C Johns  Park CA                     0
46 Broadway Johns Park CA 95472        46 Broadway Johns  Park CA                     0
46 State Route 19 Johns Park CA 9547   46 State Route 19  Park CA                     0
46 John F Kennedy Drive Johns Park C   46 John F Kennedy  Park CA                     0
PO Box 213 Somewhere IO 1234            0 Somewhere            IO                     0
1 BEACH DR SE # 2410 ST PETERSBURG F    1 BEACH DR SE # 2 St   PETERSBURG      FL 33701
# 123 12 BEACH DR SE ST PETERSBURG F   12 BEACH DR SE     St   PETERSBURG      FL 33701
46 Broad Street #12 Suburb CA 95472    46 Broad           St                          0

I have developed a Perl module that can identify many of these more difficult patterns https://metacpan.org/release/Lingua-EN-AddressParse . It recognizes idioms such as 'The Parade", nth Street, sub property addresses such as "46 Broad Street #12" and many more.

回复收藏 0 原文

~没有更多了~