如何将汉字一一拆分?

发布于 2024-08-30 16:52:25 字数 417 浏览 1 评论 0原文

如果名字和姓氏之间没有特殊字符(例如空格:等)。

那么下面如何拆分汉字呢。

use strict; 
use warnings; 
use Data::Dumper;  

my $fh = \*DATA;  
my $fname; # 小三; 
my $lname; # 张 ;
while(my $name = <$fh>)
{

    $name =~ ??? ;
    print $fname"/n";
    print $lname;

}

__DATA__  
张小三

输出

小三
张

[更新]

WinXP。使用ActivePerl5.10.1。

If there is no special character(such as white space, : etc) between firstname and lastname.

Then how to split the Chinese characters below.

use strict; 
use warnings; 
use Data::Dumper;  

my $fh = \*DATA;  
my $fname; # 小三; 
my $lname; # 张 ;
while(my $name = <$fh>)
{

    $name =~ ??? ;
    print $fname"/n";
    print $lname;

}

__DATA__  
张小三

Output

小三
张

[Update]

WinXP. ActivePerl5.10.1 used.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

朕就是辣么酷 2024-09-06 16:52:25

您遇到问题是因为您忽略了在输入期间将二进制数据解码为 Perl 字符串并在输出期间将 Perl 字符串编码为二进制数据。其原因是正则表达式及其朋友 split 在 Perl 字符串上正常工作。

(?<=.) 表示“第一个字符之后”。因此,该程序无法在复姓/复合姓氏上正常工作;请记住,它们很罕见,但确实存在。为了始终正确地将名字拆分为姓氏和名字部分,您需要使用包含姓氏的字典。

Linux 版本:

use strict;
use warnings;
use Encode qw(decode encode);

while (my $full_name = <DATA>) {
    $full_name = decode('UTF-8', $full_name);
    chomp $full_name;
    my ($family_name, $given_name) = split(/(?<=.)/, $full_name, 2);
    print encode('UTF-8',
        sprintf('The full name is %s, the family name is %s, the given name is %s.', $full_name, $family_name, $given_name)
    );

}

__DATA__
张小三

输出:

The full name is 张小三, the family name is 张, the given name is 小三.

Windows 版本:

use strict;
use warnings;
use Encode qw(decode encode);
use Encode::HanExtra qw();

while (my $full_name = <DATA>) {
    $full_name = decode('GB18030', $full_name);
    chomp $full_name;
    my ($family_name, $given_name) = split(/(?<=.)/, $full_name, 2);
    print encode('GB18030',
        sprintf('The full name is %s, the family name is %s, the given name is %s.', $full_name, $family_name, $given_name)
    );

}

__DATA__
张小三

输出:

The full name is 张小三, the family name is 张, the given name is 小三.

You have problems because you neglect to decode binary data to Perl strings during input and encode Perl strings to binary data during output. The reason for this is that regular expressions and its friend split work properly on Perl strings.

(?<=.) means "after the first character". As such, this program will not work correctly on 复姓/compound family names; keep in mind that they are rare, but do exist. In order to always correctly split a name into family name and given name parts, you need to use a dictionary with family names.

Linux version:

use strict;
use warnings;
use Encode qw(decode encode);

while (my $full_name = <DATA>) {
    $full_name = decode('UTF-8', $full_name);
    chomp $full_name;
    my ($family_name, $given_name) = split(/(?<=.)/, $full_name, 2);
    print encode('UTF-8',
        sprintf('The full name is %s, the family name is %s, the given name is %s.', $full_name, $family_name, $given_name)
    );

}

__DATA__
张小三

Output:

The full name is 张小三, the family name is 张, the given name is 小三.

Windows version:

use strict;
use warnings;
use Encode qw(decode encode);
use Encode::HanExtra qw();

while (my $full_name = <DATA>) {
    $full_name = decode('GB18030', $full_name);
    chomp $full_name;
    my ($family_name, $given_name) = split(/(?<=.)/, $full_name, 2);
    print encode('GB18030',
        sprintf('The full name is %s, the family name is %s, the given name is %s.', $full_name, $family_name, $given_name)
    );

}

__DATA__
张小三

Output:

The full name is 张小三, the family name is 张, the given name is 小三.
雨轻弹 2024-09-06 16:52:25

您需要某种启发式方法来区分名字和姓氏。这是一些工作代码,假设姓氏(姓氏)是一个字符(第一个字符),所有其余字符(至少一个)都属于名字(给定名称):

编辑:更改程序以忽略无效行而不是快死了。

use strict;
use utf8;

binmode STDOUT, ":utf8";

while (my $name = <DATA>) {
    my ($lname, $fname) = $name =~ /^(\p{Han})(\p{Han}+)$/ or next;
    print "First name: $fname\nLast name: $lname\n";
}

__DATA__  
张小三

当我从命令行运行该程序时,我得到以下输出:

First name: 小三
Last name: 张

You'll need some kind of heuristic to separate the first and last names. Here's some working code that assumes that the last name (surname) is one character (the first) and all the remaining characters (at least one) belong to the first name (given name):

EDIT: Changed program to ignore invalid lines rather than dying.

use strict;
use utf8;

binmode STDOUT, ":utf8";

while (my $name = <DATA>) {
    my ($lname, $fname) = $name =~ /^(\p{Han})(\p{Han}+)$/ or next;
    print "First name: $fname\nLast name: $lname\n";
}

__DATA__  
张小三

When I run this program from the command line, I get this output:

First name: 小三
Last name: 张
他夏了夏天 2024-09-06 16:52:25

这会分割字符并将它们分配给 $fname 和 $lname。

my ($fname, $lname) = $name =~ m/ ( \X ) /gx;

虽然我认为你的例子和你的问题并不真正匹配(姓氏有两个字符。

This splits the characters and assigns them to $fname and $lname.

my ($fname, $lname) = $name =~ m/ ( \X ) /gx;

Though I think your example and your question don't really match (the lastname has two characters.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文