当前位置：文江博客话题详情

Perl Hash byte-order-mark

为什么我不能使用 Map 函数从 Perl 中的简单数据文件创建良好的哈希值？

发布于 2024-08-12 00:06:27 字数 2601 浏览 6 评论 0 原文

帖子已更新。如果您已经阅读了发布的问题，请跳至解决方案部分。谢谢！

这是展示我的问题的最小化代码：

用于测试的输入数据文件已由Windows内置记事本保存为UTF-8编码。它有以下三行：

abacus  æbәkәs
abalone æbәlәuni
abandon әbændәn

Perl脚本文件也已被Windows内置记事本保存为UTF-8编码。它包含以下代码：

#!perl -w

use Data::Dumper;
use strict;
use autodie;
open my $in,'<',"./hash_test.txt";
open my $out,'>',"./hash_result.txt";

my %hash = map {split/\t/,$_,2} <$in>;
print $out Dumper(\%hash),"\n";
print $out "$hash{abacus}";
print $out "$hash{abalone}";
print $out "$hash{abandon}";

在输出中，哈希表似乎没问题：

$VAR1 = {
          'abalone' => 'æbәlәuni
',
          'abandon' => 'әbændәn',
          'abacus' => 'æbәkәs
'
        };

但实际上不是，因为我只得到两个值而不是三个：

æbәlәuni
әbændәn

Perl 给出以下警告消息：

Use of uninitialized value $hash C:\test2.pl 第 11 行的字符串中的 {"abacus"}，<$i n>第 3 行。

问题出在哪里？有人可以解释一下吗？谢谢。

解决方案

非常感谢你们所有人:) 现在终于找到了罪魁祸首，问题就可以解决了:) 正如 @Sinan 深刻指出的那样，我现在 100% 确定导致上述问题的罪魁祸首是 BOM 的两个字节，记事本在将其保存为 UTF-8 时将其添加到我的数据文件中，并且以某种方式 Perl没有正确对待。虽然很多人建议我应该使用“<:utf8”和“>:utf8”来读写文件，但问题是这些utf-8配置并不能解决问题。相反，它们可能会导致一些其他问题。

要真正解决问题，我实际上需要的是添加一行代码来强制 Perl 忽略 BOM：

#!perl -w

use Data::Dumper;
use strict;
use autodie;

open my $in,'<',"./hash_test.txt";
open my $out,'>',"./hash_result.txt";

seek $in,3,0; # force Perl to ignore the BOM!
my %hash = map {split/\t/,$_,2} <$in>;
print $out Dumper(\%hash);
print $out $hash{abacus};
print $out $hash{abalone};
print $out $hash{abandon};

现在，输出正是我所期望的：

$VAR1 = {
          'abalone' => 'æbәlәuni
',
          'abandon' => 'әbændәn',
          'abacus' => 'æbәkәs
'
        };
æbәkәs
æbәlәuni
әbændәn

请注意脚本保存为 UTF-8 编码，并且代码执行不必包含任何 utf-8 标签，因为输入文件和输出文件都预先保存为 UTF-8 编码。

最后再次感谢大家。感谢@Sinan 的富有洞察力的指导。如果没有你的帮助，天知道我会在黑暗中待多久。

注意为了更清楚一点，如果我使用：

open my $in,'<:utf8',"./hash_test.txt";
open my $out,'>:utf8',"./hash_result.txt";

my %hash = map {split/\t/,$_,2} <$in>;
print $out Dumper(\%hash);
print $out $hash{abacus};
print $out $hash{abalone};
print $out $hash{abandon};

输出是这样的：

$VAR1 = {
          'abalone' => "\x{e6}b\x{4d9}l\x{4d9}uni
",
          'abandon' => "\x{4d9}b\x{e6}nd\x{4d9}n",
          "\x{feff}abacus" => "\x{e6}b\x{4d9}k\x{4d9}s
"
        };
æbәlәuni
әbændәn

和警告消息：

Use of uninitialized value in print at C:\hash_test.pl line 13,  line 3.

原文

The post is updated. Please kindly jump to the Solution part, if you've already read the posted question. Thanks!

Here's the minimized code to exhibit my problem:

The input data file for test has been saved by Window's built-in Notepad as UTF-8 encoding.
It has the following three lines:

abacus  æbәkәs
abalone æbәlәuni
abandon әbændәn

The Perl script file has also been saved by Window's built-in Notepad as UTF-8 encoding.
It contains the following code:

#!perl -w

use Data::Dumper;
use strict;
use autodie;
open my $in,'<',"./hash_test.txt";
open my $out,'>',"./hash_result.txt";

my %hash = map {split/\t/,$_,2} <$in>;
print $out Dumper(\%hash),"\n";
print $out "$hash{abacus}";
print $out "$hash{abalone}";
print $out "$hash{abandon}";

In the output, the hash table seems to be okay:

$VAR1 = {
          'abalone' => 'æbәlәuni
',
          'abandon' => 'әbændәn',
          'abacus' => 'æbәkәs
'
        };

But it is actually not, because I only get two values instead of three:

æbәlәuni
әbændәn

Perl gives the following warning message:

Use of uninitialized value $hash{"abacus"} in string at C:\test2.pl line 11, <$i n> line 3.

where's the problem? Can someone kindly explain? Thanks.

The Solution

Millions of thanks to all of you guys :) Now finally the culprit is found and the problem becomes fixable :)
As @Sinan insightfully pointed out, I'm now 100% sure that the culprit for causing the problem I described above is the two bytes of BOM, which Notepad added to my data file when it was saved as UTF-8 and which somehow Perl does not treat properly. Although many suggested that I should use "<:utf8" and ">:utf8" to read and write files, the thing is these utf-8 configurations do not solve the problem. Instead they may cause some other problems.

To really solve the problem, all I actually need is to add one line of code to force Perl to ignore the BOM:

#!perl -w

use Data::Dumper;
use strict;
use autodie;

open my $in,'<',"./hash_test.txt";
open my $out,'>',"./hash_result.txt";

seek $in,3,0; # force Perl to ignore the BOM!
my %hash = map {split/\t/,$_,2} <$in>;
print $out Dumper(\%hash);
print $out $hash{abacus};
print $out $hash{abalone};
print $out $hash{abandon};

Now, the output is exactly what I expected:

$VAR1 = {
          'abalone' => 'æbәlәuni
',
          'abandon' => 'әbændәn',
          'abacus' => 'æbәkәs
'
        };
æbәkәs
æbәlәuni
әbændәn

Please note the script is saved as UTF-8 encoding and the code does not have to include any utf-8 labels because the input file and the output file are both pre-saved as UTF-8 encoding.

Finally thanks again to all of you. And thank you, @Sinan, for the insightful guidance. Without your help, I would stay in the dark for God know how long.

Note
To clarify a little more, if I use:

open my $in,'<:utf8',"./hash_test.txt";
open my $out,'>:utf8',"./hash_result.txt";

my %hash = map {split/\t/,$_,2} <$in>;
print $out Dumper(\%hash);
print $out $hash{abacus};
print $out $hash{abalone};
print $out $hash{abandon};

The output is this:

$VAR1 = {
          'abalone' => "\x{e6}b\x{4d9}l\x{4d9}uni
",
          'abandon' => "\x{4d9}b\x{e6}nd\x{4d9}n",
          "\x{feff}abacus" => "\x{e6}b\x{4d9}k\x{4d9}s
"
        };
æbәlәuni
әbændәn

And the warning message:

Use of uninitialized value in print at C:\hash_test.pl line 13,  line 3.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

带刺的爱情 2024-08-19 00:06:27

我发现警告消息有点可疑。它告诉您，$in 文件句柄位于第 3 行，而在读取最后一行后，它应该位于第 4 行。

当我尝试你的代码时，我使用 GVim 保存输入文件，该文件在我的系统上配置为保存为 UTF-8，我没有看到问题。现在我用记事本尝试了它，查看输出文件，我看到：

"\x{feff}abacus" => "\x{e6}b\x{4d9}k\x{4d9}s
"

其中 \x{feff} 是 BOM。

在 Dumper 输出中，abacus 之前有虚假空白（您没有为输出句柄指定 :utf8）。

正如我最初提到的（在这篇文章中丢失了无数次编辑 - 感谢霍布斯的提醒），在打开输入文件时指定 '<:utf8' 。

I find the warning message a little suspicious. It tells you that the $in filehandle is at line 3 when it should be at line 4 after having read the last line.

When I tried your code, I saved the input file using GVim which is configured on my system to save as UTF-8, I did not see the problem. Now that I tried it with Notepad, looking at the output file, I see:

"\x{feff}abacus" => "\x{e6}b\x{4d9}k\x{4d9}s
"

where \x{feff} is the BOM.

In your Dumper output, there is spurious blank before abacus (where you had not specified :utf8 for the output handle).

As I had mentioned originally (lost to the umpteen edits on this post — thanks for the reminder hobbs), specify '<:utf8' when you are opening the input file.

回复收藏 0 原文

盗心人 2024-08-19 00:06:27

如果您想读取/写入 UTF8 文件，您应该确保您实际上正在将它们读取为 UTF8。

#! /usr/bin/env perl
use Data::Dumper;
open my $in,  '<:utf8', "hash_test.txt";
open my $out, '>:utf8', "hash_result.txt";

my %hash = map { chomp; split ' ', $_, 2 } <$in>;
print $out Dumper(\%hash),"\n";
print $out "$hash{abacus}\n";
print $out "$hash{abalone}\n";
print $out "$hash{abandon}\n";

如果您希望它更健壮，建议使用 :encoding(utf8) 而不是 :utf8 来读取文件。

open my $in, '<:encoding(utf8)', "hash_test.txt";

阅读 PerlIO 了解更多信息。

If you want to read/write UTF8 files, you should make sure that you are actually reading them in as UTF8.

#! /usr/bin/env perl
use Data::Dumper;
open my $in,  '<:utf8', "hash_test.txt";
open my $out, '>:utf8', "hash_result.txt";

my %hash = map { chomp; split ' ', $_, 2 } <$in>;
print $out Dumper(\%hash),"\n";
print $out "$hash{abacus}\n";
print $out "$hash{abalone}\n";
print $out "$hash{abandon}\n";

If you want it to be more robust, it is recommended to use :encoding(utf8) instead of :utf8, for reading a file.

open my $in, '<:encoding(utf8)', "hash_test.txt";

Read PerlIO for more information.

回复收藏 0 原文

只涨不跌 2024-08-19 00:06:27

我想你的答案可能就在你的面前。您发布的 Data::Dumper 的输出是：

$VAR1 = {
          'abalone' => 'æbәlәuni
',
          'abandon' => 'әbændәn',
          'abacus' => 'æbәkәs
'
        };

注意 ' 和 abacus 之间的字符吗？您尝试通过 $hash{abacus} 访问第三个值。这是不正确的，因为 Dumper() 哈希中 abacus 之前的字符。您可以尝试将其插入一个应该处理它的循环中：

foreach my $k (keys %hash) {
  print $out $hash{$k};
}

I think your answer may be sitting right in front of you. The output from Data::Dumper which you posted is:

$VAR1 = {
          'abalone' => 'æbәlәuni
',
          'abandon' => 'әbændәn',
          'abacus' => 'æbәkәs
'
        };

Notice the character between the ' and abacus? You tried to access the third value via $hash{abacus}. This is incorrect because of that character before abacus in the Dumper() hash. You could try plugging it into a loop which should take care of it:

foreach my $k (keys %hash) {
  print $out $hash{$k};
}

回复收藏 0 原文

蝶…霜飞 2024-08-19 00:06:27

split/\s/ 而不是 split/\t/

回复收藏 0 原文

°如果伤别离去 2024-08-19 00:06:27

为我工作。您确定您的示例与您的实际代码和数据相符吗？

回复收藏 0 原文

~没有更多了~

关于作者

装迷糊

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

为什么我不能使用 Map 函数从 Perl 中的简单数据文件创建良好的哈希值？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

为什么我不能使用 Map 函数从 Perl 中的简单数据文件创建良好的哈希值？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（5）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。