Perl 字符串内部结构

发布于 2024-09-03 20:02:40 字数 1138 浏览 2 评论 0原文

Perl 字符串内部如何表示？使用什么编码？如何正确处理不同的编码？

我已经使用 perl 相当长一段时间了，但它没有包含大量不同编码中的字符串处理，当我遇到与编码有关的小问题时，我通常会采取一些萨满行为。

直到这一刻，我才将 Perl 字符串视为字节序列，这确实非常适合我的任务。现在我需要对 UTF-8 编码的文件进行一些处理，这里开始出现麻烦。

首先，我将文件读入字符串中，如下所示：

open(my $in, '<', $ARGV[0]) or die "cannot open file $ARGV[0] for reading";
binmode($in, ':utf8');

my $contents;

{
    local $/;
    $contents = <$in>;
}

close($in);

然后简单地打印它：

print $contents;

我得到两件事：警告 Wide character in print at;行 和控制台中的垃圾。所以我可以得出结论，perl 字符串有一个“字符”的概念，可以是“宽”也可以不是，但是当打印时这些“宽”字符在控制台中表示为多个字节，而不是单个“字符”。（我现在想知道为什么我以前使用二进制文件的所有经验都按照我预期的方式工作，没有任何“字符”问题）。

为什么我在控制台中看到垃圾？如果 perl 将字符串存储为某种已知编码中的字符，我认为找出控制台编码并正确打印文本不会有大问题。（顺便说一句，我使用 Windows）。

如果perl 将字符串存储为可变宽度字符序列（例如使用相同的UTF-8 编码），为什么要这样做？根据我的 C 经验，处理字符串是痛苦的。

更新。

我使用两台计算机进行测试，其中一台运行 Windows 7 x64，安装了英语语言包，但使用俄语区域设置（因此我将 cp866 作为 OEM 代码页，将 cp1251 作为 ANSI）和 ActivePerl 5.10.1 x64；另一个使用 Cygwin Perl 5.10.0 运行 Windows XP 32 位俄语本地化。

感谢链接，现在我对正在发生的事情以及应该如何做有了更扎实的了解。

原文

How do perl strings represented internally? What encoding is used? How do I handle different encodings properly?

I've been using perl for quite a long time, but it didn't include a lot of string handling in different encodings, and when I encountered a minor problem that had something to do with encodings I usually resorted to some shamanic actions.

Until this moment I thought about perl strings as sequences of bytes, which did fit pretty well for my tasks. Now I need to do some processing of UTF-8 encoded file and here starts trouble.

First, I read file into string like this:

open(my $in, '<', $ARGV[0]) or die "cannot open file $ARGV[0] for reading";
binmode($in, ':utf8');

my $contents;

{
    local $/;
    $contents = <$in>;
}

close($in);

then simply print it:

print $contents;

And I get two things: a warning Wide character in print at <scriptname> line <n> and a garbage in console. So I can conclude that perl strings have a concept of "character" that can be "wide" or not, but when printed these "wide" characters are represented in console as multiple bytes, not as single "character".
(I wonder now why did all my previous experience with binary files worked quite how I expected it to work without any "character" issues).

Why then I see garbage in console? If perl stores strings as character in some known encoding, I don't think there is a big problem to find out console encoding and print text properly. (I use Windows, BTW).

If perl stores strings as variable-width character sequences (e.g. using same UTF-8 encoding), why is it done this way? From my C experience handling strings is PAIN.

Update.

I use two computers for testing, one runs Windows 7 x64 with English language pack installed, but with Russian regional settings (so I have cp866 as OEM codepage and cp1251 as ANSI) with ActivePerl 5.10.1 x64; another runs Windows XP 32 bit Russian localization with Cygwin Perl 5.10.0.

Thanks to links, now I have much more solid understanding on what's going on and how things should be done.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

不及他 2024-09-10 20:02:40

在读取文件之前设置 utf8 是好的，它会自动将字节解码为内部编码。（这也是 UTF-8，但您不需要知道，也不应该依赖。）

在打印之前，您需要将字符编码回字节。

use Encode;  
utf8::encode($contents);

对于除 unicode 之外的其他编码，还有一个包含两个参数的编码形式。（这句话太重复了，不是吗？）

这是一个很好的参考。（本来应该更多，但这是我的第一篇文章。）也请查看 perlunitut，以及 Joel on Software 上的 unicode 文章。

http://www.ahinea.com/en/tech/perl- unicode-struggle.html

哦，它必须使用多字节字符串，否则它就不是 unicode。

Setting utf8 before reading from the file is good, it automagically decodes the bytes into the internal encoding. (Which is also UTF-8 but you don't need to know, and shouldn't rely on.)

Before printing you need to encode the characters back to bytes.

use Encode;  
utf8::encode($contents);

There is also a two argument form of encode, for other encodings than unicode. (That sentence echoes too much, doesn't it?)

Here is a good reference. (Would have been more, but it's my first post.) Check out perlunitut too, and the unicode article on Joel on Software.

http://www.ahinea.com/en/tech/perl-unicode-struggle.html

Oh, and it must use multi-byte strings, because otherwise it's just not unicode.

回复收藏 0 原文

清风疏影 2024-09-10 20:02:40

Perl 字符串在内部以两种编码之一存储，即面向 8 位字节的本机编码或 UTF-8。为了向后比较，除非另有说明，否则假设所有 I/O 和字符串均采用本机编码。本机编码通常是 8 位 ASCII，但这可以通过 use locale 进行更改。

在您的示例中，您在输入句柄上调用 binmode，将其更改为使用 :utf8 语义。这样做的效果之一是从此句柄读取的所有字符串都将编码为 UTF-8。 print 默认写入 STDOUT，而 STDOUT 默认使用本机编码字符。

Perl 尝试做正确的事情将允许将 UTF-8 字符串发送到本机编码输出，但如果没有附加到该句柄的编码，那么它必须猜测如何输出多字节字符，并且它将几乎肯定猜错了。这就是警告的含义，多字节字符被发送到仅期望单字节字符的流，结果是该字符可能在翻译过程中被损坏。

根据您想要完成的任务，您可以使用 dylan 提到的 Encode 模块将 UTF-8 数据转换为可以安全打印的单字节字符集，或者如果您知道附加到 STDOUT 可以处理 UTF-8 您可以使用 binmode(STDOUT, ':utf8'); 告诉 Perl 您希望发送到 STDOUT 的任何数据都作为 UTF-8 发送。