在 Win32 Perl 中使用 XML::Twig 的字符串损坏和不可打印字符
这真是一个奇怪的问题。我几乎花了一整天的时间才将其缩减为一个完整演示问题的小型可执行脚本。
我怀疑这是一个 XML::Twig 错误,因为在 while 循环中从文件句柄读取一行时发生字符串损坏,但是当我删除中的 XML::Twig 代码时,我未能成功地重现我的问题 。
这是我第一次尝试处理字符串中的不可打印字符 我是否需要做一些特殊的事情,而不是将它们视为普通的字符串或其他东西?
我在 Windows XP 上使用 ActiveState Perl 5.10.1 和 XML::Twig 3.32(最新)以及 Eclipse 3.5.1 IDE。
下面是一个演示该问题的脚本:
use strict;
use warnings;
use XML::Twig;
my $FALSE = 0;
my $TRUE = 1;
my $name = 'KurtsProgram';
my $task = 'MainTask';
my $hidden_char = "\xBF";
my $data = $hidden_char .
'(*********************************************
Data-File-Header-Junk
**********************************************)
PROGRAM MainProgram ()
END_PROGRAM
TASK SecondaryTask ()
END_TASK
TASK MainTask ()
MainProgram;
END_TASK
';
my $new_data = insertProgram( $name, $task, $data );
# test to see if results start out as expected
if ( $new_data =~ m/^\Q$hidden_char\E/ ) {
print "SUCCESS\n";
}
else {
print STDERR "ERROR: What happened?\n";
print STDERR "ORIGINAL: \n$data\n";
print STDERR "MODIFIED: \n$new_data\n";
}
sub insertProgram {
my ( $local_name, $local_task, $local_data ) = @_;
# get program section from XML template
my $twig = new XML::Twig;
$twig->parse( '<?xml version="1.0"?>
<TemplateSet>
<PROGRAM>PROGRAM <Name>ProgramNameGoesHere</Name> ()
END_PROGRAM</PROGRAM>
<TASK>TASK <Name>TaskNameGoesHere</Name> ()
END_TASK</TASK>
</TemplateSet>
' );
my $program = $twig->root->first_child('PROGRAM');
# replace program name in XML template
$program->first_child('Name')->set_text($local_name);
my $insert = $program->text();
# stick modified program into data
if ( $local_data =~ s/(\s+PROGRAM\s+[^\s]+\s+\()/\n\n $insert $1/ ) {
# found it and inserted new program
}
else {
# not found
return;
}
# add program name to task list
my $added_program_to_task = $FALSE;
my $found_start = $FALSE;
my $found_end = $FALSE;
my $new_data = "";
# open string as a filehandle for line by line processing
my $filehandle;
open( $filehandle, '<', \$local_data )
or die("Can't open string as a filehandle: $!");
while (defined (my $line = <$filehandle>)) {
# look for start of our task
if (
( !$found_start ) &&
( $line =~ m/\s+TASK\s+\Q$local_task\E\s+\(/ )
) {
# found the task!
$found_start = $TRUE;
}
# look for end of our task
if (
( $found_start ) && ( !$found_end ) &&
( $line =~ m/\s+END_TASK/ )
)
{
# found the end tag for the task section!
$found_end = $TRUE;
# add the program name to the bottom of the list
$line = " " . $local_name . ";\n" . $line;
$added_program_to_task = $TRUE;
}
# compile new data from processed line or original line
$new_data = $new_data . $line;
}
close($filehandle);
if ($added_program_to_task) {
# success
}
else {
# unable to find task
return;
}
return $new_data;
}
当我运行此脚本时,我得到以下输出:
ERROR: What happened?
ORIGINAL:
¿(*********************************************
Data-File-Header-Junk
**********************************************)
PROGRAM MainProgram ()
END_PROGRAM
TASK SecondaryTask ()
END_TASK
TASK MainTask ()
MainProgram;
END_TASK
MODIFIED:
¿(*********************************************
Data-File-Header-Junk
**********************************************)
PROGRAM KurtsProgram ()
END_PROGRAM
PROGRAM MainProgram ()
END_PROGRAM
TASK SecondaryTask ()
END_TASK
TASK MainTask ()
MainProgram;
KurtsProgram;
END_TASK
您可以看到添加到 MODIFIED 中 M 下方数据前面的额外字符。
This is a really weird problem. It's taken me practically all day to whittle it down to a small executable script that demonstrates the problem fully.
Problem Summary: I'm using XML::Twig to pull a data snippet from an XML file, then I'm sticking that data snippet into the middle of another piece of data, let's call it parent data. The parent data has this weird non-printable character at its beginning when I start. It's vendor supplied data, so I cannot control it. My problem is that after I stick the data snippet into the middle of the parent data, the final product has a new non-printable character at its beginning in addition to the one it started with originally. This new non-printable character was not in either the parent data nor in the child data snippet. I don't know where it's coming from nor how it's getting into my data.
I'm doubtful that it is an XML::Twig bug because the string corruption occurs while reading a line from a filehandle in a while loop, but I've been unsuccessful at recreating my problem when I remove the XML::Twig code in my scripts so I had to leave it in.
This is my first experience with non-printable characters in strings that I'm trying to process. Do I need to do something special instead of treating them like ordinary strings or something?
I'm using ActiveState Perl 5.10.1 and XML::Twig 3.32 (latest) and the Eclipse 3.5.1 IDE on Windows XP.
Here is a script that demonstrates the problem:
use strict;
use warnings;
use XML::Twig;
my $FALSE = 0;
my $TRUE = 1;
my $name = 'KurtsProgram';
my $task = 'MainTask';
my $hidden_char = "\xBF";
my $data = $hidden_char .
'(*********************************************
Data-File-Header-Junk
**********************************************)
PROGRAM MainProgram ()
END_PROGRAM
TASK SecondaryTask ()
END_TASK
TASK MainTask ()
MainProgram;
END_TASK
';
my $new_data = insertProgram( $name, $task, $data );
# test to see if results start out as expected
if ( $new_data =~ m/^\Q$hidden_char\E/ ) {
print "SUCCESS\n";
}
else {
print STDERR "ERROR: What happened?\n";
print STDERR "ORIGINAL: \n$data\n";
print STDERR "MODIFIED: \n$new_data\n";
}
sub insertProgram {
my ( $local_name, $local_task, $local_data ) = @_;
# get program section from XML template
my $twig = new XML::Twig;
$twig->parse( '<?xml version="1.0"?>
<TemplateSet>
<PROGRAM>PROGRAM <Name>ProgramNameGoesHere</Name> ()
END_PROGRAM</PROGRAM>
<TASK>TASK <Name>TaskNameGoesHere</Name> ()
END_TASK</TASK>
</TemplateSet>
' );
my $program = $twig->root->first_child('PROGRAM');
# replace program name in XML template
$program->first_child('Name')->set_text($local_name);
my $insert = $program->text();
# stick modified program into data
if ( $local_data =~ s/(\s+PROGRAM\s+[^\s]+\s+\()/\n\n $insert $1/ ) {
# found it and inserted new program
}
else {
# not found
return;
}
# add program name to task list
my $added_program_to_task = $FALSE;
my $found_start = $FALSE;
my $found_end = $FALSE;
my $new_data = "";
# open string as a filehandle for line by line processing
my $filehandle;
open( $filehandle, '<', \$local_data )
or die("Can't open string as a filehandle: $!");
while (defined (my $line = <$filehandle>)) {
# look for start of our task
if (
( !$found_start ) &&
( $line =~ m/\s+TASK\s+\Q$local_task\E\s+\(/ )
) {
# found the task!
$found_start = $TRUE;
}
# look for end of our task
if (
( $found_start ) && ( !$found_end ) &&
( $line =~ m/\s+END_TASK/ )
)
{
# found the end tag for the task section!
$found_end = $TRUE;
# add the program name to the bottom of the list
$line = " " . $local_name . ";\n" . $line;
$added_program_to_task = $TRUE;
}
# compile new data from processed line or original line
$new_data = $new_data . $line;
}
close($filehandle);
if ($added_program_to_task) {
# success
}
else {
# unable to find task
return;
}
return $new_data;
}
When I run this script, I get the following output:
ERROR: What happened?
ORIGINAL:
¿(*********************************************
Data-File-Header-Junk
**********************************************)
PROGRAM MainProgram ()
END_PROGRAM
TASK SecondaryTask ()
END_TASK
TASK MainTask ()
MainProgram;
END_TASK
MODIFIED:
¿(*********************************************
Data-File-Header-Junk
**********************************************)
PROGRAM KurtsProgram ()
END_PROGRAM
PROGRAM MainProgram ()
END_PROGRAM
TASK SecondaryTask ()
END_TASK
TASK MainTask ()
MainProgram;
KurtsProgram;
END_TASK
You can see the extra character that was added to the front of the data right under the M in MODIFIED.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
它对字符进行了 ISO-8859-1 到 UTF-8 编码转换:
\xBF
-> <代码>\xC2\xBF。XML::Twig 将其所有输入转换为 UTF-8(参见此处) 。
您可以使用
keep_encoding
选项(另请参阅 XML::Twig 常见问题解答:我的 XML 文档/数据是由不支持 Unicode 的工具生成的,XML::Twig 会帮助我吗?)。但也许保留 UTF-8 会更好,或者默默地删除该字符,具体取决于您要如何处理它。
It has done an ISO-8859-1 to UTF-8 encoding conversion on the character:
\xBF
->\xC2\xBF
.XML::Twig converts all its input to UTF-8 (see here).
You could tell Twig to keep the input encoding using the
keep_encoding
option (also see the XML::Twig FAQ: My XML documents/data are produced by tools that do not grok Unicode, will XML::Twig help me there?).But perhaps it would be better to keep the UTF-8, or perhaps silently drop the character, depending on what exactly you're going to do with it.
我无法真正理解您的代码,它仍然太复杂而无法快速调试,但也许问题与 BOM 有关(请参阅 Unicode BOM 常见问题解答)在 XML 文档的开头会被忽略,但如果将其复制到另一个文档的中间则不会被忽略?只是在这里猜测,因为 xBF 值是 UTF-8 文档的 BOM 的一部分。
I can't really make sense of your code, it is still too complex to be quickly debugged, but maybe the problem has to do with a BOM (see the Unicode BOM FAQ) that would be ignored at the beginning of an XML document, but not if you copy it in the middle of an other one? just guessing here because of the xBF value, that's part of the BOM for a UTF-8 document.