如何使用perl中的正则表达式计算文件中的中文单词?
我尝试按照 perl 代码来计算文件的中文单词,它似乎有效,但没有得到正确的结果。非常感谢任何帮助。
在我看来,错误消息
Use of uninitialized value $valid in concatenation (.) or string at word_counting.pl line 21, <FILE> line 21.
Total things = 125, valid words =
似乎是文件格式的问题。 “总数”是 125,即字符串数(125 行)。最奇怪的是我的控制台正确显示了所有单个中文单词,没有任何问题。 utf-8
pragma 已安装。
#!/usr/bin/perl -w
use strict;
use utf8;
use Encode qw(encode);
use Encode::HanExtra;
my $input_file = "sample_file.txt";
my ($total, $valid);
my %count;
open (FILE, "< $input_file") or die "Can't open $input_file: $!";
while (<FILE>) {
foreach (split) { #break $_ into words, assign each to $_ in turn
$total++;
next if /\W|^\d+/; #strange words skip the remainder of the loop
$valid++;
$count{$_}++; # count each separate word stored in a hash
## next comes here ##
}
}
print "Total things = $total, valid words = $valid\n";
foreach my $word (sort keys %count) {
print "$word \t was seen \t $count{$word} \t times.\n";
}
##---Data----
sample_file.txt
那天约二更时,只见封肃方回来,欢天喜地.众人忙问端的.他乃说道:"原来本府新升的太爷姓贾名化,本贯胡州人氏,曾与女婿旧日相交.方才在咱门前过去,因见娇杏那丫头买线, 所以他只当女婿移住于此.我一一将原故回明,那太爷倒伤感叹息了一回,又问外孙女儿,我说看灯丢了.太爷说:`不妨,我自使番役务必探访回来.'说了一回话, 临走倒送了我二两银子."甄家娘子听了,不免心中伤感.一宿无话.至次日, 早有雨村遣人送了两封银子,四匹锦缎,答谢甄家娘子,又寄一封密书与封肃,转托问甄家娘子要那娇杏作二房. 封肃喜的屁滚尿流,巴不得去奉承,便在女儿前一力撺掇成了,乘夜只用一乘小轿,便把娇杏送进去了.雨村欢喜,自不必说,乃封百金赠封肃, 外谢甄家娘子许多物事,令其好生养赡,以待寻访女儿下落.封肃回家无话.
I tried following perl code to count the Chinese word of a file, it seems working but not get the right thing. Any help is greatly appreciated.
The Error message is
Use of uninitialized value $valid in concatenation (.) or string at word_counting.pl line 21, <FILE> line 21.
Total things = 125, valid words =
which seems to me the problem is the file format. The "total thing" is 125 that is the string number (125 lines). The strangest part is my console displayed all the individual Chinese words correctly without any problem. The utf-8
pragma is installed.
#!/usr/bin/perl -w
use strict;
use utf8;
use Encode qw(encode);
use Encode::HanExtra;
my $input_file = "sample_file.txt";
my ($total, $valid);
my %count;
open (FILE, "< $input_file") or die "Can't open $input_file: $!";
while (<FILE>) {
foreach (split) { #break $_ into words, assign each to $_ in turn
$total++;
next if /\W|^\d+/; #strange words skip the remainder of the loop
$valid++;
$count{$_}++; # count each separate word stored in a hash
## next comes here ##
}
}
print "Total things = $total, valid words = $valid\n";
foreach my $word (sort keys %count) {
print "$word \t was seen \t $count{$word} \t times.\n";
}
##---Data----
sample_file.txt
那天约二更时,只见封肃方回来,欢天喜地.众人忙问端的.他乃说道:"原来本府新升的太爷姓贾名化,本贯胡州人氏,曾与女婿旧日相交.方才在咱门前过去,因见娇杏那丫头买线, 所以他只当女婿移住于此.我一一将原故回明,那太爷倒伤感叹息了一回,又问外孙女儿,我说看灯丢了.太爷说:`不妨,我自使番役务必探访回来.'说了一回话, 临走倒送了我二两银子."甄家娘子听了,不免心中伤感.一宿无话.至次日, 早有雨村遣人送了两封银子,四匹锦缎,答谢甄家娘子,又寄一封密书与封肃,转托问甄家娘子要那娇杏作二房. 封肃喜的屁滚尿流,巴不得去奉承,便在女儿前一力撺掇成了,乘夜只用一乘小轿,便把娇杏送进去了.雨村欢喜,自不必说,乃封百金赠封肃, 外谢甄家娘子许多物事,令其好生养赡,以待寻访女儿下落.封肃回家无话.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我们将 STDOUT 设置为 :utf8 IO 层 这样就不会显示格式错误的数据,然后打开同一层的文件,这样钻石就不会读取格式错误的数据。
之后,在 while 内,我们不使用空字符串进行分割,而是使用带有 "East_Asian_Width: Wide" 的正则表达式类似 Unicode 的属性。
utf8 用于我个人的健全性检查,可以删除(Y)。
编辑:J-16 SDiZ 和 daxim 指出
sample_file.txt
采用 UTF-8 的可能性很小。阅读他们的评论,然后查看 perldoc 中的 Encode 模块,特别是“Encoding via PerlIO” ' 部分。We set STDOUT to the :utf8 IO layer so the says won't show malformed the data, then open the file with the same layer so that the diamond won't read malformed data.
Afterward, inside the while, rather than splitting on the empty string, we use a regex with the "East_Asian_Width: Wide" Unicode-like property.
utf8 is for my personal sanity checking, and can be removed (Y).
EDIT: J-16 SDiZ and daxim pointed out that the chances of
sample_file.txt
being in UTF-8 are.. slim. Read their comments, then take a look at the Encode module in perldoc, specifically the 'Encoding via PerlIO' portion.我也许可以提供一些见解,但很难判断我的回答是否“有帮助”。首先,我只会说和读英语,所以我显然不会说或读中文。我恰好是 RegexKitLite 的作者,它是一个 Objective-C 包装器ICU 正则表达式引擎。这显然不是
perl
,:)。尽管如此,ICU 正则表达式引擎恰好有一个听起来非常像您正在尝试做的功能。具体来说,ICU 正则表达式引擎包含
UREGEX_UWORD
修饰符选项,可以通过正常的(?w:...)
语法动态打开该选项。该修改器执行以下操作:您可以在
(?w:\b(.*?)\b)
之类的正则表达式中使用它来从字符串中“提取”单词。在 ICU 正则表达式引擎中,它有一个相当强大的“断词引擎”,专门用于查找没有明确空格“字符”的书面语言(例如英语)中的断词。再说一遍,不阅读或编写这些语言,我的理解是“大致是这样的”。 ICU 断词引擎使用启发式方法(有时还使用字典)来找到断词。据我了解,泰语恰好是一个特别困难的案例。事实上,我碰巧使用ฉันกินข้าว
(泰语“我吃米饭”,或者我被告知)与正则表达式(?w)\b\s* 对字符串执行
split
操作以提取单词。如果没有(?w)
,您将无法按单词分隔符进行拆分。使用(?w)
会产生单词ฉัน
、กิน
和ข้าว
。如果上述“听起来像是您遇到的问题”,那么这可能就是原因。如果是这种情况,那么我不知道有什么方法可以在
perl
中完成此任务,但我不会认为这个观点是权威答案,因为我使用 ICU 正则表达式引擎的频率比 < code>perl 一个,而且当我已经有了一个时,我显然没有适当的动力去寻找一个可行的perl
解决方案:)。希望这有帮助。I may be able to offer some insight, but it's hard to tell if my answer will be "helpful". First, I only speak and read english, so I obviously do not speak or read chinese. I do happen to be the author of RegexKitLite, which is an Objective-C wrapper around the ICU regex engine. This is obviously not
perl
, :).Despite this, the ICU regex engine happens to have a feature that sounds remarkably like what it is that you're trying to do. Specifically, the ICU regex engine contains the
UREGEX_UWORD
modifier option, which can be turned on dynamically via the normal(?w:...)
syntax. This modifier performs the following action:You can use this in a regex like
(?w:\b(.*?)\b)
to "extract" words from a string. In the ICU regex engine, it has a fairly powerful "word breaking engine" that is specifically designed to find word breaks in written languages that do not have an explicit space 'character', like english. Again, not reading or writing these languages, my understanding is that "itisroughlysomethinglikethis". The ICU word breaking engine uses heuristics, and occasionally dictionaries, to be able to find the word breaks. It is my understanding that Thai happens to be a particularly difficult case. In fact, I happen to useฉันกินข้าว
(Thai for "I eat rice", or so I was told) with a regex of(?w)\b\s*
to perform asplit
operation on the string to extract the words. Without(?w)
you can not split on word breaks. With(?w)
it results in the wordsฉัน
,กิน
, andข้าว
.Provided the above "sounds like the problem you're having", then this could be the reason. If this is the case, then I am not aware of any way to accomplish this in
perl
, but I wouldn't consider this opinion an authoritative answer since I use the ICU regex engine more often than theperl
one and am clearly not properly motivated to find a workingperl
solution when I've already got one :). Hope this helps.