解析大型 XML 文件?

发布于 2024-10-02 13:21:02 字数 4770 浏览 2 评论 0原文

我有 2 个 xml 文件,其中一个大小为 115mb,另一个大小为 34mb。

Wiile 读取文件 A 时,有 1 个名为 desc 的字段与文件 B 相关,我从文件 B 中检索字段 id,其中 desc.file A 等于 name.file B。

文件 A 已经太大了,所以我必须在内部搜索文件B需要很长时间才能完成。

我怎样才能加快这个过程或者什么是更好的方法?

我正在使用的当前代码:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Simple qw(:strict XMLin);

my $npcs = XMLin('Client/client_npcs.xml', KeyAttr => { }, ForceArray => [ 'npc_client' ]);
my $strings = XMLin('Client/client_strings.xml', KeyAttr => { }, ForceArray => [ 'string' ]);

my ($nameid,$rank);

open (my $fh, '>>', 'Output/npc_templates.xml');
print $fh "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<npc_templates xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:noNamespaceSchemaLocation=\"npcs.xsd\">\n";
foreach my $npc ( @{ $npcs->{npc_client} } ) {
        if (defined $npc->{desc}) {
                foreach my $string (@{$strings->{string}}) {
                        if (defined $string->{name} && $string->{name} =~ /$npc->{desc}/i) {
                                $nameid = $string->{id};
                                last;
                        }
                }
        } else {
                $nameid = "";
        }

        if (defined $npc->{hpgauge_level} && $npc->{hpgauge_level} > 25 && $npc->{hpgauge_level} < 28) {
            $rank = 'LEGENDARY';
        } elsif (defined $npc->{hpgauge_level} && $npc->{hpgauge_level} > 21 && $npc->{hpgauge_level} < 23) {
            $rank = 'HERO';
        } elsif (defined $npc->{hpgauge_level} && $npc->{hpgauge_level} > 10 && $npc->{hpgauge_level} < 15) {
            $rank = 'ELITE';
        } elsif (defined $npc->{hpgauge_level} && $npc->{hpgauge_level} > 0 && $npc->{hpgauge_level} < 11) {
            $rank = 'NORMAL';
        } else {
            $rank = $gauge;
        }

        print $fh qq|\t<npc_template npc_id="$npc->{id}" name="$npc->{name}" name_id="$nameid" height="$npc->{scale}" rank="$rank" tribe="$npc->{tribe}" race="$npc->{race_type}" hp_gauge="$npc->{hpgauge_level}"/>\n|;
}
print $fh "</<npc_templates>";
close($fh);

文件 A.xml 的示例:

<?xml version="1.0" encoding="utf-16"?>
<npc_clients>
  <npc_client>
    <id>200000</id>
    <name>SkillZone</name>
    <desc>STR_NPC_NO_NAME</desc>
    <dir>Monster/Worm</dir>
    <mesh>Worm</mesh>
    <material>mat_mob_reptile</material>
    <show_dmg_decal>0</show_dmg_decal>
    <ui_type>general</ui_type>
    <cursor_type>none</cursor_type>
    <hide_path>0</hide_path>
    <erect>1</erect>
    <bound_radius>
      <front>1.200000</front>
      <side>3.456000</side>
      <upper>3.000000</upper>
    </bound_radius>
    <scale>10</scale>
    <weapon_scale>100</weapon_scale>
    <altitude>0.000000</altitude>
    <stare_angle>75.000000</stare_angle>
    <stare_distance>20.000000</stare_distance>
    <move_speed_normal_walk>0.000000</move_speed_normal_walk>
    <art_org_move_speed_normal_walk>0.000000</art_org_move_speed_normal_walk>
    <move_speed_normal_run>0.000000</move_speed_normal_run>
    <move_speed_combat_run>0.000000</move_speed_combat_run>
    <art_org_speed_combat_run>0.000000</art_org_speed_combat_run>
    <in_time>0.100000</in_time>
    <out_time>0.500000</out_time>
    <neck_angle>90.000000</neck_angle>
    <spine_angle>10.000000</spine_angle>
    <ammo_bone>Bip01 Head</ammo_bone>
    <ammo_fx>skill_stoneshard.stoneshard.ammo</ammo_fx>
    <ammo_speed>50</ammo_speed>
    <pushed_range>0.000000</pushed_range>
    <hpgauge_level>3</hpgauge_level>
    <magical_skill_boost>0</magical_skill_boost>
    <attack_delay>2000</attack_delay>
    <ai_name>SummonSkillArea</ai_name>
    <tribe>General</tribe>
    <pet_ai_name>Pet</pet_ai_name>
    <sensory_range>15.000000</sensory_range>
  </npc_client>
</npc_clients>

文件 B.xml 的示例:

<?xml version="1.0" encoding="utf-16"?>
<strings>
  <string>
    <id>350000</id>
    <name>STR_NPC_NO_NAME</name>
    <body> </body>
  </string>
</strings>

I have 2 xml files 1 with 115mb size and another with 34mb size.

Wiile reading file A there is 1 field called desc that relations it with file B where I retrieve the field id from file B where desc.file A is iqual to name.file B.

file A is already too big then I have to search inside file B and it takes a very long time to complete.

How could I speed up this proccess or what would be a better approch to do it ?

current code I am using:

#!/usr/bin/perl

use strict;
use warnings;

use XML::Simple qw(:strict XMLin);

my $npcs = XMLin('Client/client_npcs.xml', KeyAttr => { }, ForceArray => [ 'npc_client' ]);
my $strings = XMLin('Client/client_strings.xml', KeyAttr => { }, ForceArray => [ 'string' ]);

my ($nameid,$rank);

open (my $fh, '>>', 'Output/npc_templates.xml');
print $fh "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<npc_templates xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:noNamespaceSchemaLocation=\"npcs.xsd\">\n";
foreach my $npc ( @{ $npcs->{npc_client} } ) {
        if (defined $npc->{desc}) {
                foreach my $string (@{$strings->{string}}) {
                        if (defined $string->{name} && $string->{name} =~ /$npc->{desc}/i) {
                                $nameid = $string->{id};
                                last;
                        }
                }
        } else {
                $nameid = "";
        }

        if (defined $npc->{hpgauge_level} && $npc->{hpgauge_level} > 25 && $npc->{hpgauge_level} < 28) {
            $rank = 'LEGENDARY';
        } elsif (defined $npc->{hpgauge_level} && $npc->{hpgauge_level} > 21 && $npc->{hpgauge_level} < 23) {
            $rank = 'HERO';
        } elsif (defined $npc->{hpgauge_level} && $npc->{hpgauge_level} > 10 && $npc->{hpgauge_level} < 15) {
            $rank = 'ELITE';
        } elsif (defined $npc->{hpgauge_level} && $npc->{hpgauge_level} > 0 && $npc->{hpgauge_level} < 11) {
            $rank = 'NORMAL';
        } else {
            $rank = $gauge;
        }

        print $fh qq|\t<npc_template npc_id="$npc->{id}" name="$npc->{name}" name_id="$nameid" height="$npc->{scale}" rank="$rank" tribe="$npc->{tribe}" race="$npc->{race_type}" hp_gauge="$npc->{hpgauge_level}"/>\n|;
}
print $fh "</<npc_templates>";
close($fh);

example of file A.xml:

<?xml version="1.0" encoding="utf-16"?>
<npc_clients>
  <npc_client>
    <id>200000</id>
    <name>SkillZone</name>
    <desc>STR_NPC_NO_NAME</desc>
    <dir>Monster/Worm</dir>
    <mesh>Worm</mesh>
    <material>mat_mob_reptile</material>
    <show_dmg_decal>0</show_dmg_decal>
    <ui_type>general</ui_type>
    <cursor_type>none</cursor_type>
    <hide_path>0</hide_path>
    <erect>1</erect>
    <bound_radius>
      <front>1.200000</front>
      <side>3.456000</side>
      <upper>3.000000</upper>
    </bound_radius>
    <scale>10</scale>
    <weapon_scale>100</weapon_scale>
    <altitude>0.000000</altitude>
    <stare_angle>75.000000</stare_angle>
    <stare_distance>20.000000</stare_distance>
    <move_speed_normal_walk>0.000000</move_speed_normal_walk>
    <art_org_move_speed_normal_walk>0.000000</art_org_move_speed_normal_walk>
    <move_speed_normal_run>0.000000</move_speed_normal_run>
    <move_speed_combat_run>0.000000</move_speed_combat_run>
    <art_org_speed_combat_run>0.000000</art_org_speed_combat_run>
    <in_time>0.100000</in_time>
    <out_time>0.500000</out_time>
    <neck_angle>90.000000</neck_angle>
    <spine_angle>10.000000</spine_angle>
    <ammo_bone>Bip01 Head</ammo_bone>
    <ammo_fx>skill_stoneshard.stoneshard.ammo</ammo_fx>
    <ammo_speed>50</ammo_speed>
    <pushed_range>0.000000</pushed_range>
    <hpgauge_level>3</hpgauge_level>
    <magical_skill_boost>0</magical_skill_boost>
    <attack_delay>2000</attack_delay>
    <ai_name>SummonSkillArea</ai_name>
    <tribe>General</tribe>
    <pet_ai_name>Pet</pet_ai_name>
    <sensory_range>15.000000</sensory_range>
  </npc_client>
</npc_clients>

example of file B.xml:

<?xml version="1.0" encoding="utf-16"?>
<strings>
  <string>
    <id>350000</id>
    <name>STR_NPC_NO_NAME</name>
    <body> </body>
  </string>
</strings>

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

风蛊 2024-10-09 13:21:03

以下是 XML::Twig 用法的示例。主要优点是它不会将整个文件保存在内存中,因此处理速度要快得多。下面的代码试图模拟问题中脚本的操作。

use XML::Twig;

my %strings = ();
XML::Twig->new(
    twig_handlers => {
        'strings/string' => sub {
            $strings{ lc $_->first_child('name')->text }
                = $_->first_child('id')->text
        },
    }
)->parsefile('B.xml');

print "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<npc_templates xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:noNamespaceSchemaLocation=\"npcs.xsd\">\n";
XML::Twig->new(
    twig_handlers => {
        'npc_client' => sub {
            my $nameid = eval { $strings{ lc $_->first_child('desc')->text } };

            # calculate rank as needed
            my $hpgauge_level = eval { $_->first_child('hpgauge_level')->text };
            $rank = $hpgauge_level >= 28 ? 'ERROR'
                  : $hpgauge_level  > 25 ? 'LEGENDARY'
                  : $hpgauge_level  > 21 ? 'HERO'
                  : $hpgauge_level  > 10 ? 'ELITE'
                  : $hpgauge_level  >  0 ? 'NORMAL'
                  :                        $hpgauge_level;

            my $npc_id    = eval { $_->first_child('id')->text };
            my $name      = eval { $_->first_child('name')->text };
            my $tribe     = eval { $_->first_child('tribe')->text };
            my $scale     = eval { $_->first_child('scale')->text };
            my $race_type = eval { $_->first_child('race_type')->text };
            print
                qq|\t<npc_template npc_id="$npc_id" name="$name" name_id="$nameid" height="$scale" rank="$rank" tribe="$tribe" race="$race_type" hp_gauge="$hpgauge_level"/>\n|;
            $_->purge;
        }
    }
)->parsefile('A.xml');
print "</<npc_templates>";

Here is example of XML::Twig usage. The main advantage is that it is not holding whole file in memory, so processing is much faster. The code below is trying to emulate operation of script from question.

use XML::Twig;

my %strings = ();
XML::Twig->new(
    twig_handlers => {
        'strings/string' => sub {
            $strings{ lc $_->first_child('name')->text }
                = $_->first_child('id')->text
        },
    }
)->parsefile('B.xml');

print "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<npc_templates xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xsi:noNamespaceSchemaLocation=\"npcs.xsd\">\n";
XML::Twig->new(
    twig_handlers => {
        'npc_client' => sub {
            my $nameid = eval { $strings{ lc $_->first_child('desc')->text } };

            # calculate rank as needed
            my $hpgauge_level = eval { $_->first_child('hpgauge_level')->text };
            $rank = $hpgauge_level >= 28 ? 'ERROR'
                  : $hpgauge_level  > 25 ? 'LEGENDARY'
                  : $hpgauge_level  > 21 ? 'HERO'
                  : $hpgauge_level  > 10 ? 'ELITE'
                  : $hpgauge_level  >  0 ? 'NORMAL'
                  :                        $hpgauge_level;

            my $npc_id    = eval { $_->first_child('id')->text };
            my $name      = eval { $_->first_child('name')->text };
            my $tribe     = eval { $_->first_child('tribe')->text };
            my $scale     = eval { $_->first_child('scale')->text };
            my $race_type = eval { $_->first_child('race_type')->text };
            print
                qq|\t<npc_template npc_id="$npc_id" name="$name" name_id="$nameid" height="$scale" rank="$rank" tribe="$tribe" race="$race_type" hp_gauge="$hpgauge_level"/>\n|;
            $_->purge;
        }
    }
)->parsefile('A.xml');
print "</<npc_templates>";
傻比既视感 2024-10-09 13:21:03
  1. 从文件 A 中获取所有有趣的“desc”字段并将它们放入哈希中。您只需解析一次,但如果仍然需要很长时间,请查看 XML: :树枝
  2. 解析文件 B. 一次并提取您需要的内容。使用哈希。

看起来您只需要 xml 文件的一部分。 XML::Twig 可以仅解析您感兴趣的元素,并使用“twig_roots”参数丢弃其余元素。 XML::Simple 更容易上手。

  1. Grab all the interesting 'desc' fields from file A and put them in a hash. You only have to parse it once, but if it still takes too long have a look at XML::Twig.
  2. Parse file B. once and extract the stuff you need. Use the hash.

Looks like you only need parts of the xml files. XML::Twig can parse only the elements you are interested in and throw away the rest using the "twig_roots" parameter. XML::Simple is easier to get started with though..

三寸金莲 2024-10-09 13:21:03

虽然我无法帮助您了解 Perl 代码的细节,但在处理大量 XML 数据时有一些通用准则。概括地说,有 2 种 XML API:基于 DOM 的和基于 Stream 的。基于 Dom 的 API(如 XML DOM)将在用户级 API 变得“可用”之前将整个 XML 文档解析到内存中,而使用基于流的 API(如 SAX),实现不需要解析整个 XML 文档。基于流的解析器的一个好处是它们通常使用更少的内存,因为它们不需要立即将整个 XML 文档保存在内存中 - 这在处理大型 XML 文档时显然是一件好事。看看这里的 XML::Simple 文档,似乎有 可能提供 SAX 支持 - 您尝试过吗?

Although I can't help you with the specifics of your Perl code, there are some general guidelines when dealing with large volumes of XML data. There are, broadly speaking, 2 kinds of XML APIs - DOM based and Stream based. Dom based API's (like XML DOM) will parse an entire XML document in to memory before the user-level API becomes "available", whereas with a stream based API (like SAX) the implementation doesn't need to parse the whole XML document. One benefit of Stream based parsers are that they typically use much less memory, as they don't need to hold the entire XML document in memory at once - this is obviously a good thing when dealing with large XML documents. Looking at the XML::Simple docs here, it's seems there may be SAX support available - have you tried this?

梦幻之岛 2024-10-09 13:21:03

我不是 Perl 人员,所以对此持保留态度,但我看到两个问题:

  1. 事实上,您正在迭代文件 B 中的所有值,直到找到文件 B 中每个元素的正确值。文件 A 效率低下。相反,您应该对文件 B 中的值使用某种映射/字典。

  2. 看起来您在开始处理之前就正在解析内存中的这两个文件。文件 A 最好作为流进行处理,而不是将整个文档加载到内存中。

I'm not a perl guy, so take this with a grain of salt, but I see 2 problems:

  1. The fact that you are iterating over all of the values in file B until you find the correct value for each element in file A is inefficient. Instead, you should be using some sort of map/dictionary for the values in file B.

  2. It looks like you are parsing the both files in memory before you even start processing. File A would be better processed as a stream instead of loading the entire document into memory.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文