Perl 字符串内部结构

发布于 2024-09-03 20:02:40 字数 1138 浏览 2 评论 0原文

Perl 字符串内部如何表示?使用什么编码?如何正确处理不同的编码?

我已经使用 perl 相当长一段时间了,但它没有包含大量不同编码中的字符串处理,当我遇到与编码有关的小问题时,我通常会采取一些萨满行为。

直到这一刻,我才将 Perl 字符串视为字节序列,这确实非常适合我的任务。现在我需要对 UTF-8 编码的文件进行一些处理,这里开始出现麻烦。

首先,我将文件读入字符串中,如下所示:

open(my $in, '<', $ARGV[0]) or die "cannot open file $ARGV[0] for reading";
binmode($in, ':utf8');

my $contents;

{
    local $/;
    $contents = <$in>;
}

close($in);

然后简单地打印它:

print $contents;

我得到两件事:警告 Wide character in print at;行 和控制台中的垃圾。所以我可以得出结论,perl 字符串有一个“字符”的概念,可以是“宽”也可以不是,但是当打印时这些“宽”字符在控制台中表示为多个字节,而不是单个“字符”。 (我现在想知道为什么我以前使用二进制文件的所有经验都按照我预期的方式工作,没有任何“字符”问题)。

为什么我在控制台中看到垃圾?如果 perl 将字符串存储为某种已知编码中的字符,我认为找出控制台编码并正确打印文本不会有大问题。 (顺便说一句,我使用 Windows)。

如果perl 将字符串存储为可变宽度字符序列(例如使用相同的UTF-8 编码),为什么要这样做?根据我的 C 经验,处理字符串是痛苦的。

更新

我使用两台计算机进行测试,其中一台运行 Windows 7 x64,安装了英语语言包,但使用俄语区域设置(因此我将 cp866 作为 OEM 代码页,将 cp1251 作为 ANSI)和 ActivePerl 5.10.1 x64;另一个使用 Cygwin Perl 5.10.0 运行 Windows XP 32 位俄语本地化。

感谢链接,现在我对正在发生的事情以及应该如何做有了更扎实的了解。

How do perl strings represented internally? What encoding is used? How do I handle different encodings properly?

I've been using perl for quite a long time, but it didn't include a lot of string handling in different encodings, and when I encountered a minor problem that had something to do with encodings I usually resorted to some shamanic actions.

Until this moment I thought about perl strings as sequences of bytes, which did fit pretty well for my tasks. Now I need to do some processing of UTF-8 encoded file and here starts trouble.

First, I read file into string like this:

open(my $in, '<', $ARGV[0]) or die "cannot open file $ARGV[0] for reading";
binmode($in, ':utf8');

my $contents;

{
    local $/;
    $contents = <$in>;
}

close($in);

then simply print it:

print $contents;

And I get two things: a warning Wide character in print at <scriptname> line <n> and a garbage in console. So I can conclude that perl strings have a concept of "character" that can be "wide" or not, but when printed these "wide" characters are represented in console as multiple bytes, not as single "character".
(I wonder now why did all my previous experience with binary files worked quite how I expected it to work without any "character" issues).

Why then I see garbage in console? If perl stores strings as character in some known encoding, I don't think there is a big problem to find out console encoding and print text properly. (I use Windows, BTW).

If perl stores strings as variable-width character sequences (e.g. using same UTF-8 encoding), why is it done this way? From my C experience handling strings is PAIN.

Update.

I use two computers for testing, one runs Windows 7 x64 with English language pack installed, but with Russian regional settings (so I have cp866 as OEM codepage and cp1251 as ANSI) with ActivePerl 5.10.1 x64; another runs Windows XP 32 bit Russian localization with Cygwin Perl 5.10.0.

Thanks to links, now I have much more solid understanding on what's going on and how things should be done.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

不及他 2024-09-10 20:02:40

在读取文件之前设置 utf8 是好的,它会自动将字节解码为内部编码。 (这也是 UTF-8,但您不需要知道,也不应该依赖。)

在打印之前,您需要将字符编码回字节。

use Encode;  
utf8::encode($contents);

对于除 unicode 之外的其他编码,还有一个包含两个参数的编码形式。 (这句话太重复了,不是吗?)

这是一个很好的参考。 (本来应该更多,但这是我的第一篇文章。)也请查看 perlunitut,以及 Joel on Software 上的 unicode 文章。

http://www.ahinea.com/en/tech/perl- unicode-struggle.html

哦,它必须使用多字节字符串,否则它就不是 unicode。

Setting utf8 before reading from the file is good, it automagically decodes the bytes into the internal encoding. (Which is also UTF-8 but you don't need to know, and shouldn't rely on.)

Before printing you need to encode the characters back to bytes.

use Encode;  
utf8::encode($contents);

There is also a two argument form of encode, for other encodings than unicode. (That sentence echoes too much, doesn't it?)

Here is a good reference. (Would have been more, but it's my first post.) Check out perlunitut too, and the unicode article on Joel on Software.

http://www.ahinea.com/en/tech/perl-unicode-struggle.html

Oh, and it must use multi-byte strings, because otherwise it's just not unicode.

清风疏影 2024-09-10 20:02:40

Perl 字符串在内部以两种编码之一存储,即面向 8 位字节的本机编码或 UTF-8。为了向后比较,除非另有说明,否则假设所有 I/O 和字符串均采用本机编码。本机编码通常是 8 位 ASCII,但这可以通过 use locale 进行更改。

在您的示例中,您在输入句柄上调用 binmode,将其更改为使用 :utf8 语义。这样做的效果之一是从此句柄读取的所有字符串都将编码为 UTF-8。 print 默认写入 STDOUT,而 STDOUT 默认使用本机编码字符。

Perl 尝试做正确的事情将允许将 UTF-8 字符串发送到本机编码输出,但如果没有附加到该句柄的编码,那么它必须猜测如何输出多字节字符,并且它将几乎肯定猜错了。这就是警告的含义,多字节字符被发送到仅期望单字节字符的流,结果是该字符可能在翻译过程中被损坏。

根据您想要完成的任务,您可以使用 dylan 提到的 Encode 模块将 UTF-8 数据转换为可以安全打印的单字节字符集,或者如果您知道附加到 STDOUT 可以处理 UTF-8 您可以使用 binmode(STDOUT, ':utf8'); 告诉 Perl 您希望发送到 STDOUT 的任何数据都作为 UTF-8 发送。

Perl strings are stored internally in one of two encodings, either a 8-bit byte oriented native encoding, or UTF-8. For backwards comparability the assumption is that all I/O and strings are in native encoding, unless otherwise specified. Native encoding is usually 8-bit ASCII, but this can be changed with use locale.

In your sample you call binmode on your input handle changing it to use :utf8 semantics. One effect of this is that all strings read from this handle will be encoded as UTF-8. print writes to STDOUT by default, and STDOUT defaults to expecting native encoded characters.

Perl in an attempt to do the right thing will allow a UTF-8 string to be sent to a native encoded output, but if there is no encoding attached to that handle then it has to guess how to output multi-byte characters and it will almost certainly guess wrong. That is what the warning means, a multi-byte character was sent to a stream only expecting single byte characters and the result was that the character was probably damaged in translation.

Depending on what you want to accomplish you can use the Encode module mentioned by dylan to convert the UTF-8 data to a single byte character set that can be printed safely or if you know that whatever is attached to STDOUT can handle UTF-8 you can use binmode(STDOUT, ':utf8'); to tell Perl you want any data sent to STDOUT to be sent as UTF-8.

守不住的情 2024-09-10 20:02:40

您应该提及您的实际 Windows 和 Perl 版本,因为这实际上取决于您使用的版本和安装的语言包。
否则请先查看 PerlUnicode 手册 -

Perl 在内部使用逻辑宽字符来表示字符串。

它会证实你的陈述。

Windows 没有完全安装所有 UTF8 字符 - 因此这可能是您的问题的原因。您可能需要安装额外的语言包。

You should mention your actual Windows and Perl versions as this really depends on your used versions and installed language packages.
Otherwise have a look at the PerlUnicode manual first -

Perl uses logically-wide characters to represent strings internally.

it will confirm your statements.

Windows does not fully install all UTF8 character- thus this is might be the reason for your issue. You may need to install an additional language package.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文