“内存不足”使用 perl 解析大型 (100 Mb) XML 文件时

发布于 2024-12-02 16:30:56 字数 904 浏览 1 评论 0原文

我在解析大型 (100 Mb) XML 文件时出现错误“内存不足”，

use strict;
use warnings;
use XML::Twig;

my $twig=XML::Twig->new();
my $data = XML::Twig->new
             ->parsefile("divisionhouserooms-v3.xml")
               ->simplify( keyattr => []);

my @good_division_numbers = qw( 30 31 32 35 38 );

foreach my $property ( @{ $data->{DivisionHouseRoom}}) {

    my $house_code = $property->{HouseCode};
    print $house_code, "\n";

    my $amount_of_bedrooms = 0;

    foreach my $division ( @{ $property->{Divisions}->{Division} } ) {

        next unless grep { $_ eq $division->{DivisionNumber} } @good_division_numbers;
        $amount_of_bedrooms += $division->{DivisionQuantity};
    }

    open my $fh, ">>", "Result.csv" or die $!;
    print $fh join("\t", $house_code, $amount_of_bedrooms), "\n";
    close $fh;
}

我该如何解决此错误问题？

原文

I have an error "Out of memory" while parsing large (100 Mb) XML file

use strict;
use warnings;
use XML::Twig;

my $twig=XML::Twig->new();
my $data = XML::Twig->new
             ->parsefile("divisionhouserooms-v3.xml")
               ->simplify( keyattr => []);

my @good_division_numbers = qw( 30 31 32 35 38 );

foreach my $property ( @{ $data->{DivisionHouseRoom}}) {

    my $house_code = $property->{HouseCode};
    print $house_code, "\n";

    my $amount_of_bedrooms = 0;

    foreach my $division ( @{ $property->{Divisions}->{Division} } ) {

        next unless grep { $_ eq $division->{DivisionNumber} } @good_division_numbers;
        $amount_of_bedrooms += $division->{DivisionQuantity};
    }

    open my $fh, ">>", "Result.csv" or die $!;
    print $fh join("\t", $house_code, $amount_of_bedrooms), "\n";
    close $fh;
}

What i can do to fix this error issue?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

久夏青 2024-12-09 16:30:56

处理不适合内存的大型 XML 文件是 XML::Twig< /a> 广告：

XML::Twig 的优点之一是它可以让您处理文件
不适合内存（顺便说一句，将 XML 文档作为
树非常消耗内存，扩展因子通常是
大约10）。
为此，您可以定义处理程序，该处理程序将在
特定元素已完全解析。在这些处理程序中，您可以
访问元素并按照您认为合适的方式处理它 (...)

问题中发布的代码根本没有利用 XML::Twig 的优势（使用 simplify 方法并不比 XML::Simple）。

代码中缺少的是“twig_handlers”或“twig_roots”，它们本质上会导致解析器高效地关注 XML 文档的相关部分。

如果没有看到 XML，很难判断是否逐块处理文档或<一个href="http://search.cpan.org/~mirod/XML-Twig-3.38/Twig_pm.slow#Processing_just_parts_of_an_XML_document">仅选择部分是可行的方法，但任何一个都应该解决这个问题。

因此，代码应如下所示（逐块演示）：

use strict;
use warnings;
use XML::Twig;
use List::Util 'sum';   # To make life easier
use Data::Dump 'dump';  # To see what's going on

my %bedrooms;           # Data structure to store the wanted info

my $xml = XML::Twig->new (
                          twig_roots => {
                                          DivisionHouseRoom => \&count_bedrooms,
                                        }
                         );

$xml->parsefile( 'divisionhouserooms-v3.xml');

sub count_bedrooms {

    my ( $twig, $element ) = @_;

    my @divParents = $element->children( 'Divisions' );
    my $id = $element->first_child_text( 'HouseCode' );

    for my $divParent ( @divParents ) {
        my @divisions = $divParent->children( 'Division' );
        my $total = sum map { $_->text } @divisions;
        $bedrooms{$id} = $total;
    }

    $element->purge;   # Free up memory
}

dump \%bedrooms;

Handling large XML files that don't fit in memory is something that XML::Twig advertises:

One of the strengths of XML::Twig is that it let you work with files
that do not fit in memory (BTW storing an XML document in memory as a
tree is quite memory-expensive, the expansion factor being often
around 10).
To do this you can define handlers, that will be called once a
specific element has been completely parsed. In these handlers you can
access the element and process it as you see fit (...)

The code posted in the question isn't making use of the strength of XML::Twig at all (using the simplify method doesn't make it much better than XML::Simple).

What's missing from the code are the 'twig_handlers' or 'twig_roots', which essentially cause the parser to focus on relevant portions of the XML document memory-efficiently.

It's difficult to say without seeing the XML whether processing the document chunk-by-chunk or just selected parts is the way to go, but either one should solve this issue.

So the code should look something like the following (chunk-by-chunk demo):

use strict;
use warnings;
use XML::Twig;
use List::Util 'sum';   # To make life easier
use Data::Dump 'dump';  # To see what's going on

my %bedrooms;           # Data structure to store the wanted info

my $xml = XML::Twig->new (
                          twig_roots => {
                                          DivisionHouseRoom => \&count_bedrooms,
                                        }
                         );

$xml->parsefile( 'divisionhouserooms-v3.xml');

sub count_bedrooms {

    my ( $twig, $element ) = @_;

    my @divParents = $element->children( 'Divisions' );
    my $id = $element->first_child_text( 'HouseCode' );

    for my $divParent ( @divParents ) {
        my @divisions = $divParent->children( 'Division' );
        my $total = sum map { $_->text } @divisions;
        $bedrooms{$id} = $total;
    }

    $element->purge;   # Free up memory
}

dump \%bedrooms;

回复收藏 0 原文