使用 xml::twig 解析 xml 文件

发布于 2024-12-08 18:43:23 字数 1264 浏览 1 评论 0原文

我有以下一个大型 xml 文件，其中包含以下格式的实体：有人可以帮助我如何使用 xml::twig 处理它吗？

 <root >
      <entity id="1" last_modified="2011-10-1">
        <entity_title> title</entity_title>
        <entity_description>description  </entity_description>
        <entity_x>  x </entity_x>
        <entity_y>  x </entity_y>
        <entity_childs>
          <child flag="1">
            <child_name>name<child_name>
            <child_type>type1</child_type>
            <child_x> some_text</child__x>
          </child>
          <child flag="1">
            <child_name>name1<child_name>
            <child_type>type2</child_type>
            <child_x> some_text</child__x>
          </child>
         <entity_sibling>
          <family value="1" name="xc">fed</ext_ref>
          <family value="1" name="df">ff</ext_ref> 
         </entity_sibling>
    <\root>


 ;

我运行下面的代码并内存不足！

my $file = shift ||die $!;

my $twig = XML::Twig->new();

my $config = $twig->parsefile( $file )->simplify();

print Dumper( $config );

原文

I have the following a large xml file which have entities on the below format :
could someone help how can i proccess it with xml::twig ?

 <root >
      <entity id="1" last_modified="2011-10-1">
        <entity_title> title</entity_title>
        <entity_description>description  </entity_description>
        <entity_x>  x </entity_x>
        <entity_y>  x </entity_y>
        <entity_childs>
          <child flag="1">
            <child_name>name<child_name>
            <child_type>type1</child_type>
            <child_x> some_text</child__x>
          </child>
          <child flag="1">
            <child_name>name1<child_name>
            <child_type>type2</child_type>
            <child_x> some_text</child__x>
          </child>
         <entity_sibling>
          <family value="1" name="xc">fed</ext_ref>
          <family value="1" name="df">ff</ext_ref> 
         </entity_sibling>
    <\root>


 ;

I run the below code and get out of memory !

my $file = shift ||die $!;

my $twig = XML::Twig->new();

my $config = $twig->parsefile( $file )->simplify();

print Dumper( $config );

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

月下客 2024-12-15 18:43:23

XML::Twig 能够以两种模式运行：小型文档或大型文档。您说它很大，所以您需要文档概要中列出的第二种方法。

处理大型文档的示例如下所示：

 # at most one div will be loaded in memory
  my $twig=XML::Twig->new(   
    twig_handlers => 
      { title   => sub { $_->set_tag( 'h2') }, # change title tags to h2
        para    => sub { $_->set_tag( 'p')  }, # change para to p
        hidden  => sub { $_->delete;       },  # remove hidden elements
        list    => \&my_list_process,          # process list elements
        div     => sub { $_[0]->flush;     },  # output and free memory
      },
    pretty_print => 'indented',                # output will be nicely formatted
    empty_tags   => 'html',                    # outputs <empty_tag />
                         );
    $twig->flush;                              # flush the end of the document

所以我认为您想使用该方法，而不是您当前正在使用的方法，该方法被标记为仅适用于小文档。

XML::Twig is able to run in two modes, for small or for large documents. You say it's large, so you want the second approach listed in the documentation synopsis.

The example for processing huge documents goes like this:

 # at most one div will be loaded in memory
  my $twig=XML::Twig->new(   
    twig_handlers => 
      { title   => sub { $_->set_tag( 'h2') }, # change title tags to h2
        para    => sub { $_->set_tag( 'p')  }, # change para to p
        hidden  => sub { $_->delete;       },  # remove hidden elements
        list    => \&my_list_process,          # process list elements
        div     => sub { $_[0]->flush;     },  # output and free memory
      },
    pretty_print => 'indented',                # output will be nicely formatted
    empty_tags   => 'html',                    # outputs <empty_tag />
                         );
    $twig->flush;                              # flush the end of the document

So I think you want to use that method, not the one you're currently using which is noted as only for small documents.

回复收藏 0 原文

白日梦 2024-12-15 18:43:23

是的，XML::Twig 中没有魔法，如果您编写 $twig->parsefile( $file )->simplify(); 那么它会将整个文档加载到内存中。恐怕您将不得不投入一些工作才能获得您想要的部分并丢弃其余部分。查看概要或
有关详细信息，请参阅文档顶部的 XML::Twig 101 部分。

这正在成为常见问题解答，因此我已将上面的简介添加到模块的文档中。

在这种特殊情况下，您可能希望在实体上设置处理程序（使用 twig_handlers 选项），处理每个实体，然后使用刷新将其丢弃> 如果您要更新文件，或者 purge 如果您只想从中提取数据。

所以代码的架构应该是这样的：

#!/usr/bin/perl
use strict;
use warnings;

use XML::Twig;

my $file = shift;    

my $twig=XML::Twig->new( twig_handlers => { entity => \&process_entity },)
                  ->parsefile( $file);

exit;

sub process_entity
  { my( $t, $entity)= @_;

    # do what you have to do with $entity

   $t->purge;
  }

Yep, there is no magic in XML::Twig, if you write $twig->parsefile( $file )->simplify(); then it will load the entire document in memory. I am afraid you will have to put some work into it to get just the bits you want and discard the rest. Look at the synopsys or
the XML::Twig 101 section at the top of the docs for more information.

This is becoming a FAQ, so I have added the blurb above to the docs of the module.

In this particular case you probably want to set a handler (using the twig_handlers option) on entity, process each entity and then discard it by using flush if you are updating the file, or purge if you just want to extract data from it.

So the architecture of the code should look like this:

#!/usr/bin/perl
use strict;
use warnings;

use XML::Twig;

my $file = shift;    

my $twig=XML::Twig->new( twig_handlers => { entity => \&process_entity },)
                  ->parsefile( $file);

exit;

sub process_entity
  { my( $t, $entity)= @_;

    # do what you have to do with $entity

   $t->purge;
  }

回复收藏 0 原文

~没有更多了~