我正在编写一个 Perl 脚本,需要从 XML 文件中提取一些数据。
XML 文件本身使用 UTF-8 进行编码。然而,由于某种原因,我从文件中提取的内容最终被编码为 ISO-8859-1。文档指出传递给我的处理程序的任何内容都应该是 UTF-8,但事实并非如此。
解析器基本上是这样的:
my $parser = XML::Parser->new( Handlers => {
# Some unrelated handlers here
Char => sub {
my ( $expat, $string ) = @_;
if ( exists $data->{$curId}{$curField} ) {
$data->{$curId}{$curField} .= $string;
} else {
$data->{$curId}{$curField} = $string;
}
} ,
} );
我已经尝试了以下实际解析的变体:
- 直接通过
$parser->parsefile
解析文件,没有选项;
- 通过
$parser->parsefile
直接解析文件,并使用 ProtocolEncoding
选项;
- 使用
open( $handle , " 打开文件,然后通过 $parser->parse
; 进行解析;
- 使用
open( $handle , '<:utf8' , "file.xml" )
打开文件,然后通过 $parser->parse
进行解析。
此外,我还尝试了文件中带有和不带有
标头的每个版本。
在所有情况下,最终出现在 $data->{$curId}{$curField}
中的内容均使用 ISO-8859-1 进行编码。
我做错了什么?
I am writing a Perl script that needs to extract some data from an XML file.
The XML file itself is encoded using UTF-8. For some reason, however, what I extract from the file ends up being encoded as ISO-8859-1. The documentation states that whatever is passed to my handlers should be UTF-8, but it just isn't.
The parser is basically something like this:
my $parser = XML::Parser->new( Handlers => {
# Some unrelated handlers here
Char => sub {
my ( $expat, $string ) = @_;
if ( exists $data->{$curId}{$curField} ) {
$data->{$curId}{$curField} .= $string;
} else {
$data->{$curId}{$curField} = $string;
}
} ,
} );
I have tried the following variants for actually parsing:
- file parsed directly through
$parser->parsefile
, no options;
- file parsed directly through
$parser->parsefile
, with the ProtocolEncoding
option;
- file opened using
open( $handle , "<file.xml" )
then parsed through $parser->parse
;
- file opened using
open( $handle , '<:utf8' , "file.xml" )
then parsed through $parser->parse
.
In addition, I have tried each version with and without the <?xml encoding="utf-8"?>
header in the file.
In all cases, what ends up in $data->{$curId}{$curField}
is encoded using ISO-8859-1.
What am I doing wrong?
发布评论
评论(2)
我知道您已经在评论中找到了 Michel 的答案,但我会添加一些内容。对于任何编码,您都必须严格了解您正在接收的内容和您正在发送的内容。如果你需要什么,不要依赖环境;最终其他人会使用你的程序并拥有一个搞砸的环境。
当您读取文件时,不要使用 ':utf8' 层。这并不关心八位字节是否实际上是 UTF-8:
无论您认为输出句柄是什么,请显式设置它。有多种方法可以执行此操作:
在命令行中,您可以使用带有 S 标志的 -C 开关来使标准句柄为 UTF-8:
Tom Christiansen 有一长串需要注意的事情。
I know you already found an answer from Michel in the comments, but I'll add a few things. With any encoding, you have to be strict about knowing what you're taking in and what you are sending out. If you need something, don't rely on the environment; eventually someone else will use your program and have a screwed-up environment.
When you are reading a file, don't use the ':utf8' layer. That doesn't care if the octets are actually UTF-8:
No matter what you think your output handle is, set it explicitly. There are a variety of ways to do this:
From the command-line, you can use the -C switch with the S flag to make the standard handles UTF-8:
Tom Christiansen has a long list of things you need to pay attention to.
$data->{$curId}{$curField}
是否打开了 utf8 标志?如果将 utf8 标志打开的字符串与 utf8 标志关闭的字符串连接起来,Perl 会将后者转换为 Unicode。这是问题的常见根源。
Does
$data->{$curId}{$curField}
have utf8 flag on?If you concatenate a string with the utf8 flag on with a string that has utf8 flag off, Perl converts the latter to Unicode. This is the usual source of problems.