将姓名列表分隔为：“FirstName {TAB} Lastname”对

发布于 2024-12-24 03:00:33 字数 671 浏览 5 评论 0原文

是否有特定的库、算法或技术（除了使用正则表达式之外）如果您想转换/翻译以下几行，请使用。

"Acme Corporation Inc., John, Doe, F."
"Smith, Allen, Smith,Susan"
"Marshall, J., L., Johnson, H., Caruso, D., Jones, J."
"Stein, Harry, Joan, and Mike"

这些行应转换为包含以下内容的文本：

Acme {TAB} Corporation
Doe {TAB} John
Smith {TAB} Allen
Smith {TAB} Susan
Marshall {TAB} J.
Johnson {TAB} H.
Caruso {TAB} D.
Jones {TAB} J.
Stein {TAB} Harry
Stein {TAB} Joan
Stein {TAB} Mike

原始文本仅包含专有名称和中间名缩写（D. 或 J.），除了偶尔用“and”分隔与最后一行具有相同姓氏的兄弟姐妹以上原文。

另外，这被认为是“命名实体识别”还是还有其他一些技术这个过程的名称？

理想情况下，我想要使用 Ruby/Python/Perl/PHP 等语言编写的代码或算法进行此翻译。

有什么想法吗？提前致谢。

原文

Is there a specific library, algorithm or technique (besides using Regular expressions)
to use if you want to convert/translate the following lines.

"Acme Corporation Inc., John, Doe, F."
"Smith, Allen, Smith,Susan"
"Marshall, J., L., Johnson, H., Caruso, D., Jones, J."
"Stein, Harry, Joan, and Mike"

These lines should be converted into text containing:

Acme {TAB} Corporation
Doe {TAB} John
Smith {TAB} Allen
Smith {TAB} Susan
Marshall {TAB} J.
Johnson {TAB} H.
Caruso {TAB} D.
Jones {TAB} J.
Stein {TAB} Harry
Stein {TAB} Joan
Stein {TAB} Mike

The original text contains only proper names and middle initials (D. or J.) except for
an occasional "and" separating siblings with the same last name as in the last line
of original text above.

Also, is this considered to be "Named Entity Recognition" or is there some other technical
name for this process?

Ideally, i would like code or algorithms in a language like Ruby/Python/Perl/PHP that could
make this translation.

Any Ideas? Thanks in advance.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

淡淡绿茶香 2024-12-31 03:00:33

这几乎有效：

#!/usr/bin/env perl
use strict;
use warnings;

my $tok = undef;
my @pairs = ();
my $looking_for = 'surname';

sub parse_line_to_words($){
    my $l = shift;
    my @words;
    my $word = '';
    my $start = 1;

    # remove trailing newlines
    chomp $l;
    if(index($l, '"', -1) != -1){
            # remove trailing quotation mark.
            chop $l;
    }
    foreach my $c (split//,$l){
            if($c eq '"'){
                    if($#words == -1){
                            # skip leading quotation marks
                            next;
                    }
            }

            if($c eq ','){
                    push(@words, $word);
                    $word = '';
                    $start = 1;
            } else{
                    if($start && $c eq ' '){
                            next;
                    } else{
                            $start = 0;
                    }
                    $word .= $c;
            }
    }
    if($word ne ''){
            push(@words, $word);
    }
    return @words;
}
sub peek_and(@){
    foreach my $word (@_){
            return 1 if $word eq 'and'
    }
    return 0;
}
sub split_and(@){
    my @copy;
    foreach my $word (@_){
            if(index($word, 'and ', 0) != -1){
                    my $i = index($word, 'and ', 0) + 4;
                    push(@copy, substr($word, 0, $i - 1));
                    push(@copy, substr($word, $i));
            } else{
                    push(@copy, $word);
            }
    }
    return @copy;
}
sub count_spaces($){
    my $w = shift;
    my $s=0;
    for(my $p = index($w, ' ', 0); $p != -1; $p=index($w, ' ', $p+1), $s++) {}
    return $s;
}
sub found($$){
    my $pairs = shift;
    push(@{$pairs}, {'surname' => shift, 'firstname' => shift});
}
while(<>){
    chomp;
    my $line = $_;
    my @words = parse_line_to_words($line);
    @words = split_and(@words);
    my $line_has_and = peek_and(@words);
    foreach my $word (@words){
            my $spaces = count_spaces($word);

            if($looking_for eq 'surname'){
                    if(index($word, '.', -1) != -1 && $spaces == 0){
                            # looks like an initial to me, skip it
                    } else{
                            if($spaces > 0){
                                    # multi-word token; must be corporation name
                                    my($f, $l) = split(/ /, $word);
                                    found(\@pairs, $f, $l);
                            } else{
                                    $tok = $word;
                                    $looking_for = 'firstname';
                            }
                    }
            } elsif ($looking_for eq 'firstname'){
                    if($line_has_and){
                            # lastname, first1, ..., firstn and firstn+1
                            if($word ne 'and'){
                                    found(\@pairs, $tok, $word);
                            }
                    } else{
                            # lastname, f. or lastname, firstname
                            found(\@pairs, $tok, $word);
                            $looking_for = 'surname';
                    }
            }
    }
    $looking_for = 'surname'; # reset for new line
}

foreach my $p (@pairs){
    printf("%s\t%s\n", $p->{'surname'}, $p->{'firstname'});
}

给定示例输入的实际输出

Acme    Corporation
John    Doe
Smith   Allen
Smith   Susan
Marshall        J.
Johnson H.
Caruso  D.
Jones   J.
Stein   Harry
Stein   Joan
Stein   Mike

讨论

我采用了以下启发式：

应忽略行上的前导和尾随引号。
每行都可以标记为单词，作为一系列逗号分隔的值。
如果单词以空格字符开头，则应忽略这些字符。
任何一对单词中的第一个单词是姓氏，第二个单词是名字（特殊情况除外）。
如果一行中的单词以“and”开头，则整行应特殊对待，其中第一个单词是姓氏，其余单词是相应的名字。
如果姓氏超过 0 个空格，则它是公司名称
公司名称始终是两个以空格分隔的单词，应分别视为姓氏和名字。
非公司名称不包含空格。

最后我使用“正则表达式”只是为了在空间上分割公司名称；这可以简单地用非正则表达式版本替换。

即使如此，我仍然得到“John Doe”错误，因为它的名字在输入中被颠倒了。我无法设计出可靠的方法来检测这一点。

This works, almost:

#!/usr/bin/env perl
use strict;
use warnings;

my $tok = undef;
my @pairs = ();
my $looking_for = 'surname';

sub parse_line_to_words($){
    my $l = shift;
    my @words;
    my $word = '';
    my $start = 1;

    # remove trailing newlines
    chomp $l;
    if(index($l, '"', -1) != -1){
            # remove trailing quotation mark.
            chop $l;
    }
    foreach my $c (split//,$l){
            if($c eq '"'){
                    if($#words == -1){
                            # skip leading quotation marks
                            next;
                    }
            }

            if($c eq ','){
                    push(@words, $word);
                    $word = '';
                    $start = 1;
            } else{
                    if($start && $c eq ' '){
                            next;
                    } else{
                            $start = 0;
                    }
                    $word .= $c;
            }
    }
    if($word ne ''){
            push(@words, $word);
    }
    return @words;
}
sub peek_and(@){
    foreach my $word (@_){
            return 1 if $word eq 'and'
    }
    return 0;
}
sub split_and(@){
    my @copy;
    foreach my $word (@_){
            if(index($word, 'and ', 0) != -1){
                    my $i = index($word, 'and ', 0) + 4;
                    push(@copy, substr($word, 0, $i - 1));
                    push(@copy, substr($word, $i));
            } else{
                    push(@copy, $word);
            }
    }
    return @copy;
}
sub count_spaces($){
    my $w = shift;
    my $s=0;
    for(my $p = index($w, ' ', 0); $p != -1; $p=index($w, ' ', $p+1), $s++) {}
    return $s;
}
sub found($$){
    my $pairs = shift;
    push(@{$pairs}, {'surname' => shift, 'firstname' => shift});
}
while(<>){
    chomp;
    my $line = $_;
    my @words = parse_line_to_words($line);
    @words = split_and(@words);
    my $line_has_and = peek_and(@words);
    foreach my $word (@words){
            my $spaces = count_spaces($word);

            if($looking_for eq 'surname'){
                    if(index($word, '.', -1) != -1 && $spaces == 0){
                            # looks like an initial to me, skip it
                    } else{
                            if($spaces > 0){
                                    # multi-word token; must be corporation name
                                    my($f, $l) = split(/ /, $word);
                                    found(\@pairs, $f, $l);
                            } else{
                                    $tok = $word;
                                    $looking_for = 'firstname';
                            }
                    }
            } elsif ($looking_for eq 'firstname'){
                    if($line_has_and){
                            # lastname, first1, ..., firstn and firstn+1
                            if($word ne 'and'){
                                    found(\@pairs, $tok, $word);
                            }
                    } else{
                            # lastname, f. or lastname, firstname
                            found(\@pairs, $tok, $word);
                            $looking_for = 'surname';
                    }
            }
    }
    $looking_for = 'surname'; # reset for new line
}

foreach my $p (@pairs){
    printf("%s\t%s\n", $p->{'surname'}, $p->{'firstname'});
}

Actual output for given sample input

Acme    Corporation
John    Doe
Smith   Allen
Smith   Susan
Marshall        J.
Johnson H.
Caruso  D.
Jones   J.
Stein   Harry
Stein   Joan
Stein   Mike

Discussion

I employed the following heuristics:

Leading and trailing quotation marks on a line should be ignored.
Each line can be tokenized into words as a series of comma-delimited values.
If a word begins with space characters then those characters should be ignored.
The first word of any pair of words is a surname, the second a first name (except special cases).
If a word on a line begins with 'and ' the entire line should be treated specially where the first word is a surname and the rest are corresponding first names.
If a surname has more than 0 spaces then it is the name of a corporation
A corporation name is always two space-delimited words which should be treated as surname and firstname respectively.
Non-corporation names do not contain spaces.

In the end I used "regular expressions" only to split corporation names on space; this could be trivially replaced with a non-regex version.

Even with all of this I still get "John Doe" wrong, because its names are reversed in the input. I couldn't devise a reliable way to detect this.

回复收藏 0 原文

~没有更多了~