在 Win32 Perl 中使用 XML::Twig 的字符串损坏和不可打印字符

发布于 2024-08-10 09:36:24 字数 4536 浏览 9 评论 0原文

这真是一个奇怪的问题。我几乎花了一整天的时间才将其缩减为一个完整演示问题的小型可执行脚本。

问题摘要:我正在使用 XML::Twig< /a> 从 XML 文件中提取数据片段,然后将该数据片段粘贴到另一条数据的中间,我们将其称为父数据。当我开始时,父数据的开头有一个奇怪的不可打印字符。它是供应商提供的数据,所以我无法控制它。我的问题是,当我将数据片段粘贴到父数据的中间后,最终产品除了最初开始的字符之外,在其开头还有一个 new 不可打印字符。这个新的不可打印字符既不在父数据中,也不在子数据片段中。我不知道它从哪里来,也不知道它是如何进入我的数据的。

我怀疑这是一个 XML::Twig 错误,因为在 while 循环中从文件句柄读取一行时发生字符串损坏,但是当我删除中的 XML::Twig 代码时,我未能成功地重现我的问题 。

这是我第一次尝试处理字符串中的不可打印字符 我是否需要做一些特殊的事情,而不是将它们视为普通的字符串或其他东西?

我在 Windows XP 上使用 ActiveState Perl 5.10.1 和 XML::Twig 3.32(最新)以及 Eclipse 3.5.1 IDE。

下面是一个演示该问题的脚本:

use strict; 
use warnings; 
use XML::Twig; 

my $FALSE = 0;
my $TRUE = 1;
my $name = 'KurtsProgram';
my $task = 'MainTask';
my $hidden_char = "\xBF";
my $data = $hidden_char . 
'(*********************************************
  Data-File-Header-Junk
**********************************************)

    PROGRAM MainProgram ()
    END_PROGRAM

    TASK SecondaryTask ()
    END_TASK

    TASK MainTask ()
        MainProgram;
    END_TASK
';
my $new_data = insertProgram( $name, $task, $data );

# test to see if results start out as expected
if ( $new_data =~ m/^\Q$hidden_char\E/ ) {
    print "SUCCESS\n";
}
else {
    print STDERR "ERROR: What happened?\n";
    print STDERR "ORIGINAL: \n$data\n";
    print STDERR "MODIFIED: \n$new_data\n";
}

sub insertProgram {
    my ( $local_name, $local_task, $local_data ) = @_;

    # get program section from XML template
    my $twig = new XML::Twig;
    $twig->parse( '<?xml version="1.0"?>
<TemplateSet>
    <PROGRAM>PROGRAM <Name>ProgramNameGoesHere</Name> ()
    END_PROGRAM</PROGRAM>
    <TASK>TASK <Name>TaskNameGoesHere</Name> ()
    END_TASK</TASK>
</TemplateSet>
' );   
    my $program = $twig->root->first_child('PROGRAM');

    # replace program name in XML template
    $program->first_child('Name')->set_text($local_name);
    my $insert = $program->text();

    # stick modified program into data
    if ( $local_data =~ s/(\s+PROGRAM\s+[^\s]+\s+\()/\n\n    $insert $1/ ) {
        # found it and inserted new program
    }
    else {
        # not found
        return;
    }

    # add program name to task list
    my $added_program_to_task = $FALSE;
    my $found_start = $FALSE;
    my $found_end = $FALSE;
    my $new_data = "";
    # open string as a filehandle for line by line processing
    my $filehandle;
    open( $filehandle, '<', \$local_data )
        or die("Can't open string as a filehandle: $!");
    while (defined (my $line = <$filehandle>)) {
        # look for start of our task
        if ( 
               ( !$found_start ) &&
               ( $line =~ m/\s+TASK\s+\Q$local_task\E\s+\(/ )
            ) {
            # found the task!
            $found_start = $TRUE;
        }

        # look for end of our task
        if (
                ( $found_start ) && ( !$found_end ) &&
                ( $line =~ m/\s+END_TASK/ )
            )
        {
            # found the end tag for the task section!
            $found_end = $TRUE;

            # add the program name to the bottom of the list
            $line = "        " . $local_name . ";\n" . $line;
            $added_program_to_task = $TRUE;
        }

        # compile new data from processed line or original line
        $new_data = $new_data . $line;
    }
    close($filehandle);

    if ($added_program_to_task) {
        # success
    }
    else {
        # unable to find task
        return;
    }

    return $new_data;
}

当我运行此脚本时,我得到以下输出:

ERROR: What happened?
ORIGINAL: 
¿(*********************************************
      Data-File-Header-Junk
    **********************************************)

        PROGRAM MainProgram ()
        END_PROGRAM

        TASK SecondaryTask ()
        END_TASK

        TASK MainTask ()
            MainProgram;
        END_TASK

MODIFIED: 
¿(*********************************************
      Data-File-Header-Junk
    **********************************************)

        PROGRAM KurtsProgram ()
        END_PROGRAM 

        PROGRAM MainProgram ()
        END_PROGRAM

        TASK SecondaryTask ()
        END_TASK

        TASK MainTask ()
            MainProgram;
            KurtsProgram;
        END_TASK

您可以看到添加到 MODIFIED 中 M 下方数据前面的额外字符。

This is a really weird problem. It's taken me practically all day to whittle it down to a small executable script that demonstrates the problem fully.

Problem Summary: I'm using XML::Twig to pull a data snippet from an XML file, then I'm sticking that data snippet into the middle of another piece of data, let's call it parent data. The parent data has this weird non-printable character at its beginning when I start. It's vendor supplied data, so I cannot control it. My problem is that after I stick the data snippet into the middle of the parent data, the final product has a new non-printable character at its beginning in addition to the one it started with originally. This new non-printable character was not in either the parent data nor in the child data snippet. I don't know where it's coming from nor how it's getting into my data.

I'm doubtful that it is an XML::Twig bug because the string corruption occurs while reading a line from a filehandle in a while loop, but I've been unsuccessful at recreating my problem when I remove the XML::Twig code in my scripts so I had to leave it in.

This is my first experience with non-printable characters in strings that I'm trying to process. Do I need to do something special instead of treating them like ordinary strings or something?

I'm using ActiveState Perl 5.10.1 and XML::Twig 3.32 (latest) and the Eclipse 3.5.1 IDE on Windows XP.

Here is a script that demonstrates the problem:

use strict; 
use warnings; 
use XML::Twig; 

my $FALSE = 0;
my $TRUE = 1;
my $name = 'KurtsProgram';
my $task = 'MainTask';
my $hidden_char = "\xBF";
my $data = $hidden_char . 
'(*********************************************
  Data-File-Header-Junk
**********************************************)

    PROGRAM MainProgram ()
    END_PROGRAM

    TASK SecondaryTask ()
    END_TASK

    TASK MainTask ()
        MainProgram;
    END_TASK
';
my $new_data = insertProgram( $name, $task, $data );

# test to see if results start out as expected
if ( $new_data =~ m/^\Q$hidden_char\E/ ) {
    print "SUCCESS\n";
}
else {
    print STDERR "ERROR: What happened?\n";
    print STDERR "ORIGINAL: \n$data\n";
    print STDERR "MODIFIED: \n$new_data\n";
}

sub insertProgram {
    my ( $local_name, $local_task, $local_data ) = @_;

    # get program section from XML template
    my $twig = new XML::Twig;
    $twig->parse( '<?xml version="1.0"?>
<TemplateSet>
    <PROGRAM>PROGRAM <Name>ProgramNameGoesHere</Name> ()
    END_PROGRAM</PROGRAM>
    <TASK>TASK <Name>TaskNameGoesHere</Name> ()
    END_TASK</TASK>
</TemplateSet>
' );   
    my $program = $twig->root->first_child('PROGRAM');

    # replace program name in XML template
    $program->first_child('Name')->set_text($local_name);
    my $insert = $program->text();

    # stick modified program into data
    if ( $local_data =~ s/(\s+PROGRAM\s+[^\s]+\s+\()/\n\n    $insert $1/ ) {
        # found it and inserted new program
    }
    else {
        # not found
        return;
    }

    # add program name to task list
    my $added_program_to_task = $FALSE;
    my $found_start = $FALSE;
    my $found_end = $FALSE;
    my $new_data = "";
    # open string as a filehandle for line by line processing
    my $filehandle;
    open( $filehandle, '<', \$local_data )
        or die("Can't open string as a filehandle: $!");
    while (defined (my $line = <$filehandle>)) {
        # look for start of our task
        if ( 
               ( !$found_start ) &&
               ( $line =~ m/\s+TASK\s+\Q$local_task\E\s+\(/ )
            ) {
            # found the task!
            $found_start = $TRUE;
        }

        # look for end of our task
        if (
                ( $found_start ) && ( !$found_end ) &&
                ( $line =~ m/\s+END_TASK/ )
            )
        {
            # found the end tag for the task section!
            $found_end = $TRUE;

            # add the program name to the bottom of the list
            $line = "        " . $local_name . ";\n" . $line;
            $added_program_to_task = $TRUE;
        }

        # compile new data from processed line or original line
        $new_data = $new_data . $line;
    }
    close($filehandle);

    if ($added_program_to_task) {
        # success
    }
    else {
        # unable to find task
        return;
    }

    return $new_data;
}

When I run this script, I get the following output:

ERROR: What happened?
ORIGINAL: 
¿(*********************************************
      Data-File-Header-Junk
    **********************************************)

        PROGRAM MainProgram ()
        END_PROGRAM

        TASK SecondaryTask ()
        END_TASK

        TASK MainTask ()
            MainProgram;
        END_TASK

MODIFIED: 
¿(*********************************************
      Data-File-Header-Junk
    **********************************************)

        PROGRAM KurtsProgram ()
        END_PROGRAM 

        PROGRAM MainProgram ()
        END_PROGRAM

        TASK SecondaryTask ()
        END_TASK

        TASK MainTask ()
            MainProgram;
            KurtsProgram;
        END_TASK

You can see the extra character that was added to the front of the data right under the M in MODIFIED.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

优雅的叶子 2024-08-17 09:36:24

它对字符进行了 ISO-8859-1 到 UTF-8 编码转换:\xBF -> <代码>\xC2\xBF。

XML::Twig 将其所有输入转换为 UTF-8(参见此处) 。

您可以使用 keep_encoding 选项(另请参阅 XML::Twig 常见问题解答:我的 XML 文档/数据是由不支持 Unicode 的工具生成的,XML::Twig 会帮助我吗?)。

但也许保留 UTF-8 会更好,或者默默地删除该字符,具体取决于您要如何处理它。

It has done an ISO-8859-1 to UTF-8 encoding conversion on the character: \xBF -> \xC2\xBF.

XML::Twig converts all its input to UTF-8 (see here).

You could tell Twig to keep the input encoding using the keep_encoding option (also see the XML::Twig FAQ: My XML documents/data are produced by tools that do not grok Unicode, will XML::Twig help me there?).

But perhaps it would be better to keep the UTF-8, or perhaps silently drop the character, depending on what exactly you're going to do with it.

静若繁花 2024-08-17 09:36:24

我无法真正理解您的代码,它仍然太复杂而无法快速调试,但也许问题与 BOM 有关(请参阅 Unicode BOM 常见问题解答)在 XML 文档的开头会被忽略,但如果将其复制到另一个文档的中间则不会被忽略?只是在这里猜测,因为 xBF 值是 UTF-8 文档的 BOM 的一部分。

I can't really make sense of your code, it is still too complex to be quickly debugged, but maybe the problem has to do with a BOM (see the Unicode BOM FAQ) that would be ignored at the beginning of an XML document, but not if you copy it in the middle of an other one? just guessing here because of the xBF value, that's part of the BOM for a UTF-8 document.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文