帖子已更新。如果您已经阅读了发布的问题,请跳至解决方案部分。谢谢!
这是展示我的问题的最小化代码:
用于测试的输入数据文件已由Windows内置记事本保存为UTF-8编码。
它有以下三行:
abacus æbәkәs
abalone æbәlәuni
abandon әbændәn
Perl脚本文件也已被Windows内置记事本保存为UTF-8编码。
它包含以下代码:
#!perl -w
use Data::Dumper;
use strict;
use autodie;
open my $in,'<',"./hash_test.txt";
open my $out,'>',"./hash_result.txt";
my %hash = map {split/\t/,$_,2} <$in>;
print $out Dumper(\%hash),"\n";
print $out "$hash{abacus}";
print $out "$hash{abalone}";
print $out "$hash{abandon}";
在输出中,哈希表似乎没问题:
$VAR1 = {
'abalone' => 'æbәlәuni
',
'abandon' => 'әbændәn',
'abacus' => 'æbәkәs
'
};
但实际上不是,因为我只得到两个值而不是三个:
æbәlәuni
әbændәn
Perl 给出以下警告消息:
Use of uninitialized value $hash C:\test2.pl 第 11 行的字符串中的 {"abacus"},<$i
n>第 3 行。
问题出在哪里?有人可以解释一下吗?谢谢。
解决方案
非常感谢你们所有人:) 现在终于找到了罪魁祸首,问题就可以解决了:)
正如 @Sinan 深刻指出的那样,我现在 100% 确定导致上述问题的罪魁祸首是 BOM 的两个字节,记事本在将其保存为 UTF-8 时将其添加到我的数据文件中,并且以某种方式 Perl没有正确对待。虽然很多人建议我应该使用“<:utf8”和“>:utf8”来读写文件,但问题是这些utf-8配置并不能解决问题。相反,它们可能会导致一些其他问题。
要真正解决问题,我实际上需要的是添加一行代码来强制 Perl 忽略 BOM:
#!perl -w
use Data::Dumper;
use strict;
use autodie;
open my $in,'<',"./hash_test.txt";
open my $out,'>',"./hash_result.txt";
seek $in,3,0; # force Perl to ignore the BOM!
my %hash = map {split/\t/,$_,2} <$in>;
print $out Dumper(\%hash);
print $out $hash{abacus};
print $out $hash{abalone};
print $out $hash{abandon};
现在,输出正是我所期望的:
$VAR1 = {
'abalone' => 'æbәlәuni
',
'abandon' => 'әbændәn',
'abacus' => 'æbәkәs
'
};
æbәkәs
æbәlәuni
әbændәn
请注意脚本保存为 UTF-8 编码,并且代码执行不必包含任何 utf-8 标签,因为输入文件和输出文件都预先保存为 UTF-8 编码。
最后再次感谢大家。感谢@Sinan 的富有洞察力的指导。如果没有你的帮助,天知道我会在黑暗中待多久。
注意
为了更清楚一点,如果我使用:
open my $in,'<:utf8',"./hash_test.txt";
open my $out,'>:utf8',"./hash_result.txt";
my %hash = map {split/\t/,$_,2} <$in>;
print $out Dumper(\%hash);
print $out $hash{abacus};
print $out $hash{abalone};
print $out $hash{abandon};
输出是这样的:
$VAR1 = {
'abalone' => "\x{e6}b\x{4d9}l\x{4d9}uni
",
'abandon' => "\x{4d9}b\x{e6}nd\x{4d9}n",
"\x{feff}abacus" => "\x{e6}b\x{4d9}k\x{4d9}s
"
};
æbәlәuni
әbændәn
和警告消息:
Use of uninitialized value in print at C:\hash_test.pl line 13, line 3.
The post is updated. Please kindly jump to the Solution part, if you've already read the posted question. Thanks!
Here's the minimized code to exhibit my problem:
The input data file for test has been saved by Window's built-in Notepad as UTF-8 encoding.
It has the following three lines:
abacus æbәkәs
abalone æbәlәuni
abandon әbændәn
The Perl script file has also been saved by Window's built-in Notepad as UTF-8 encoding.
It contains the following code:
#!perl -w
use Data::Dumper;
use strict;
use autodie;
open my $in,'<',"./hash_test.txt";
open my $out,'>',"./hash_result.txt";
my %hash = map {split/\t/,$_,2} <$in>;
print $out Dumper(\%hash),"\n";
print $out "$hash{abacus}";
print $out "$hash{abalone}";
print $out "$hash{abandon}";
In the output, the hash table seems to be okay:
$VAR1 = {
'abalone' => 'æbәlәuni
',
'abandon' => 'әbændәn',
'abacus' => 'æbәkәs
'
};
But it is actually not, because I only get two values instead of three:
æbәlәuni
әbændәn
Perl gives the following warning message:
Use of uninitialized value $hash{"abacus"} in string at C:\test2.pl line 11, <$i
n> line 3.
where's the problem? Can someone kindly explain? Thanks.
The Solution
Millions of thanks to all of you guys :) Now finally the culprit is found and the problem becomes fixable :)
As @Sinan insightfully pointed out, I'm now 100% sure that the culprit for causing the problem I described above is the two bytes of BOM, which Notepad added to my data file when it was saved as UTF-8 and which somehow Perl does not treat properly. Although many suggested that I should use "<:utf8" and ">:utf8" to read and write files, the thing is these utf-8 configurations do not solve the problem. Instead they may cause some other problems.
To really solve the problem, all I actually need is to add one line of code to force Perl to ignore the BOM:
#!perl -w
use Data::Dumper;
use strict;
use autodie;
open my $in,'<',"./hash_test.txt";
open my $out,'>',"./hash_result.txt";
seek $in,3,0; # force Perl to ignore the BOM!
my %hash = map {split/\t/,$_,2} <$in>;
print $out Dumper(\%hash);
print $out $hash{abacus};
print $out $hash{abalone};
print $out $hash{abandon};
Now, the output is exactly what I expected:
$VAR1 = {
'abalone' => 'æbәlәuni
',
'abandon' => 'әbændәn',
'abacus' => 'æbәkәs
'
};
æbәkәs
æbәlәuni
әbændәn
Please note the script is saved as UTF-8 encoding and the code does not have to include any utf-8 labels because the input file and the output file are both pre-saved as UTF-8 encoding.
Finally thanks again to all of you. And thank you, @Sinan, for the insightful guidance. Without your help, I would stay in the dark for God know how long.
Note
To clarify a little more, if I use:
open my $in,'<:utf8',"./hash_test.txt";
open my $out,'>:utf8',"./hash_result.txt";
my %hash = map {split/\t/,$_,2} <$in>;
print $out Dumper(\%hash);
print $out $hash{abacus};
print $out $hash{abalone};
print $out $hash{abandon};
The output is this:
$VAR1 = {
'abalone' => "\x{e6}b\x{4d9}l\x{4d9}uni
",
'abandon' => "\x{4d9}b\x{e6}nd\x{4d9}n",
"\x{feff}abacus" => "\x{e6}b\x{4d9}k\x{4d9}s
"
};
æbәlәuni
әbændәn
And the warning message:
Use of uninitialized value in print at C:\hash_test.pl line 13, line 3.
发布评论
评论(5)
我发现警告消息有点可疑。它告诉您,
$in
文件句柄位于第 3 行,而在读取最后一行后,它应该位于第 4 行。当我尝试你的代码时,我使用 GVim 保存输入文件,该文件在我的系统上配置为保存为 UTF-8,我没有看到问题。现在我用记事本尝试了它,查看输出文件,我看到:
其中
\x{feff}
是 BOM。在 Dumper 输出中,
abacus
之前有虚假空白(您没有为输出句柄指定:utf8
)。正如我最初提到的(在这篇文章中丢失了无数次编辑 - 感谢霍布斯的提醒),在打开输入文件时指定
'<:utf8'
。I find the warning message a little suspicious. It tells you that the
$in
filehandle is at line 3 when it should be at line 4 after having read the last line.When I tried your code, I saved the input file using GVim which is configured on my system to save as UTF-8, I did not see the problem. Now that I tried it with Notepad, looking at the output file, I see:
where
\x{feff}
is the BOM.In your Dumper output, there is spurious blank before
abacus
(where you had not specified:utf8
for the output handle).As I had mentioned originally (lost to the umpteen edits on this post — thanks for the reminder hobbs), specify
'<:utf8'
when you are opening the input file.如果您想读取/写入 UTF8 文件,您应该确保您实际上正在将它们读取为 UTF8。
如果您希望它更健壮,建议使用
:encoding(utf8)
而不是:utf8
来读取文件。阅读 PerlIO 了解更多信息。
If you want to read/write UTF8 files, you should make sure that you are actually reading them in as UTF8.
If you want it to be more robust, it is recommended to use
:encoding(utf8)
instead of:utf8
, for reading a file.Read PerlIO for more information.
我想你的答案可能就在你的面前。您发布的
Data::Dumper
的输出是:注意
'
和abacus
之间的字符吗?您尝试通过$hash{abacus}
访问第三个值。这是不正确的,因为Dumper()
哈希中abacus
之前的字符。您可以尝试将其插入一个应该处理它的循环中:I think your answer may be sitting right in front of you. The output from
Data::Dumper
which you posted is:Notice the character between the
'
andabacus
? You tried to access the third value via$hash{abacus}
. This is incorrect because of that character beforeabacus
in theDumper()
hash. You could try plugging it into a loop which should take care of it:split/\s/ 而不是 split/\t/
split/\s/ instead of split/\t/
为我工作。您确定您的示例与您的实际代码和数据相符吗?
Works For Me. Are you sure your example matches your actual code and data?