Perl XML::Parser 编码问题

发布于 2024-10-25 01:42:13 字数 1153 浏览 1 评论 0 原文

我正在编写一个 Perl 脚本,需要从 XML 文件中提取一些数据。

XML 文件本身使用 UTF-8 进行编码。然而,由于某种原因,我从文件中提取的内容最终被编码为 ISO-8859-1。文档指出传递给我的处理程序的任何内容都应该是 UTF-8,但事实并非如此。

解析器基本上是这样的:

my $parser = XML::Parser->new( Handlers => {
    # Some unrelated handlers here
    Char => sub {
        my ( $expat, $string ) = @_;
        if ( exists $data->{$curId}{$curField} ) {
            $data->{$curId}{$curField} .= $string;
        } else {
            $data->{$curId}{$curField} = $string;
        }
    } ,
} );

我已经尝试了以下实际解析的变体:

  • 直接通过 $parser->parsefile 解析文件,没有选项;
  • 通过 $parser->parsefile 直接解析文件,并使用 ProtocolEncoding 选项;
  • 使用 open( $handle , " 打开文件,然后通过 $parser->parse; 进行解析;
  • 使用 open( $handle , '<:utf8' , "file.xml" ) 打开文件,然后通过 $parser->parse 进行解析。

此外,我还尝试了文件中带有和不带有 标头的每个版本。

在所有情况下,最终出现在 $data->{$curId}{$curField} 中的内容均使用 ISO-8859-1 进行编码。

我做错了什么?

I am writing a Perl script that needs to extract some data from an XML file.

The XML file itself is encoded using UTF-8. For some reason, however, what I extract from the file ends up being encoded as ISO-8859-1. The documentation states that whatever is passed to my handlers should be UTF-8, but it just isn't.

The parser is basically something like this:

my $parser = XML::Parser->new( Handlers => {
    # Some unrelated handlers here
    Char => sub {
        my ( $expat, $string ) = @_;
        if ( exists $data->{$curId}{$curField} ) {
            $data->{$curId}{$curField} .= $string;
        } else {
            $data->{$curId}{$curField} = $string;
        }
    } ,
} );

I have tried the following variants for actually parsing:

  • file parsed directly through $parser->parsefile, no options;
  • file parsed directly through $parser->parsefile, with the ProtocolEncoding option;
  • file opened using open( $handle , "<file.xml" ) then parsed through $parser->parse;
  • file opened using open( $handle , '<:utf8' , "file.xml" ) then parsed through $parser->parse.

In addition, I have tried each version with and without the <?xml encoding="utf-8"?> header in the file.

In all cases, what ends up in $data->{$curId}{$curField} is encoded using ISO-8859-1.

What am I doing wrong?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

挽袖吟 2024-11-01 01:42:13

我知道您已经在评论中找到了 Michel 的答案,但我会添加一些内容。对于任何编码,您都必须严格了解您正在接收的内容和您正在发送的内容。如果你需要什么,不要依赖环境;最终其他人会使用你的程序并拥有一个搞砸的环境。

当您读取文件时,不要使用 ':utf8' 层。这并不关心八位字节是否实际上是 UTF-8:

 open my $fh, '<:encoding(UTF-8)', $filename or ...;

无论您认为输出句柄是什么,请显式设置它。有多种方法可以执行此操作:

 use open ':encoding(utf8)';

在命令行中,您可以使用带有 S 标志的 -C 开关来使标准句柄为 UTF-8:

 perl -CS input.xml

Tom Christiansen 有一长串需要注意的事情

I know you already found an answer from Michel in the comments, but I'll add a few things. With any encoding, you have to be strict about knowing what you're taking in and what you are sending out. If you need something, don't rely on the environment; eventually someone else will use your program and have a screwed-up environment.

When you are reading a file, don't use the ':utf8' layer. That doesn't care if the octets are actually UTF-8:

 open my $fh, '<:encoding(UTF-8)', $filename or ...;

No matter what you think your output handle is, set it explicitly. There are a variety of ways to do this:

 use open ':encoding(utf8)';

From the command-line, you can use the -C switch with the S flag to make the standard handles UTF-8:

 perl -CS input.xml

Tom Christiansen has a long list of things you need to pay attention to.

舟遥客 2024-11-01 01:42:13

$data->{$curId}{$curField} 是否打开了 utf8 标志?

如果将 utf8 标志打开的字符串与 utf8 标志关闭的字符串连接起来,Perl 会将后者转换为 Unicode。这是问题的常见根源。

Does $data->{$curId}{$curField} have utf8 flag on?

If you concatenate a string with the utf8 flag on with a string that has utf8 flag off, Perl converts the latter to Unicode. This is the usual source of problems.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文