当我不知道字节顺序时,如何在 Perl 中解码 UTF-16 数据?

发布于 2024-09-02 07:56:23 字数 610 浏览 3 评论 0原文

如果我打开一个文件(并直接指定编码):

open(my $file,"<:encoding(UTF-16)","some.file") || die "error $!\n";
while(<$file>) {
    print "$_\n";
}
close($file);

我可以很好地读取文件内容。但是,如果我这样做:

use Encode;

open(my $file,"some.file") || die "error $!\n";
while(<$file>) {
    print decode("UTF-16",$_);
}
close($file);

我会收到以下错误:

UTF-16:Unrecognised BOM d at F:/Perl/lib/Encode.pm line 174

How can I make I make it work with decode?

编辑:这里是前几个字节:

FF FE 3C 00 68 00 74 00

If I open a file ( and specify an encoding directly ) :

open(my $file,"<:encoding(UTF-16)","some.file") || die "error $!\n";
while(<$file>) {
    print "$_\n";
}
close($file);

I can read the file contents nicely. However, if I do:

use Encode;

open(my $file,"some.file") || die "error $!\n";
while(<$file>) {
    print decode("UTF-16",$_);
}
close($file);

I get the following error:

UTF-16:Unrecognised BOM d at F:/Perl/lib/Encode.pm line 174

How can I make it work with decode?

EDIT: here are the first several bytes:

FF FE 3C 00 68 00 74 00

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

羁〃客ぐ 2024-09-09 07:56:23

如果您只是指定“UTF-16”,Perl 将查找字节顺序标记(BOM)来找出如何解析它。如果没有BOM,就会爆炸。在这种情况下,您必须通过为小端指定“UTF-16LE”或为大端指定“UTF-16BE”来告诉 Encode 您拥有哪种字节顺序。

不过,您的情况还存在其他问题,但如果不查看文件中的数据,就很难判断。我对两个片段都遇到相同的错误。如果我没有 BOM 并且没有指定字节顺序,我的 Perl 都会抱怨。您使用哪种 Perl 以及您拥有哪个平台?您的平台是否具有文件的本机字节序?我认为根据文档我看到的行为是正确的。

另外,您不能简单地读取某种未知编码(无论 Perl 的默认编码是什么)的行,然后将其发送到 decode。您可能最终处于多字节序列的中间。您必须使用 Encode::FB_QUIET 来保存缓冲区中无法解码的部分并将其添加到下一个数据块中:

open my($lefh), '<:raw', 'text-utf16.txt';

my $string;
while( $string .= <$lefh> ) {
    print decode("UTF-16LE", $string, Encode::FB_QUIET) 
    }

If you simply specify "UTF-16", Perl is going to look for the byte-order mark (BOM) to figure out how to parse it. If there is no BOM, it's going to blow up. In that case, you have to tell Encode which byte-order you have by specifying either "UTF-16LE" for little-endian or "UTF-16BE" for big-endian.

There's something else going on with your situation though, but it's hard to tell without seeing the data you have in the file. I get the same error with both snippets. If I don't have a BOM and I don't specify a byte order, my Perl complains either way. Which Perl are you using and which platform do you have? Does your platform have the native endianness of your file? I think the behaviour I see is correct according to the docs.

Also, you can't simply read a line in some unknown encoding (whatever Perl's default is) then ship that off to decode. You might end up in the middle of a multi-byte sequence. You have to use Encode::FB_QUIET to save the part of the buffer that you couldn't decode and add that to the next chunk of data:

open my($lefh), '<:raw', 'text-utf16.txt';

my $string;
while( $string .= <$lefh> ) {
    print decode("UTF-16LE", $string, Encode::FB_QUIET) 
    }
故人的歌 2024-09-09 07:56:23

你试图做的事情是不可能的。

您正在读取文本而不指定编码,因此包含换行符(默认\x0a)的每个字节都结束一行。但是这个换行符很可能位于 UTF-16 字符的中间,在这种情况下,您的下一行将无法解码。
如果您的数据是 UTF-16LE,这种情况会一直发生 - 换行符是 \x0a \x00。如果您使用 UTF16-BE,您可能会很幸运(换行符是 \x00 \x0a),直到您在高字节中得到 \x0a 的字符。

所以,不要这样做,以正确的编码打开文件。

What you're trying to do impossible.

You're reading lines of text without specifying an encoding, so every byte that contains a newline character (default \x0a) ends a line. But this newline character may very well be in the middle of an UTF-16 character, in which case your next line can't be decoded.
If your data is UTF-16LE, this will happen all the time – line feeds are \x0a \x00. If you have UTF16-BE, you might get lucky (newlines are \x00 \x0a), until you get a character with \x0a in the high byte.

So, don't do that, open the file in the right encoding.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文