逐行解析 XML 文件

发布于 2024-12-09 17:30:02 字数 2098 浏览 5 评论 0原文

所以这就是问题所在。我正在尝试解析来自 GenBank 的 XML 信息文件。该文件包含多个 DNA 序列的信息。我已经对 genbacnk 的另外两种 xml 格式（TINY xml 和 INSD xml）完成了此操作，但纯 xml 让我头疼。这是我的程序应该如何工作。下载 xml 格式的文件，其中包含有关 GenBank 中 X 个序列的信息。运行我的 perl 脚本，逐行搜索该 xml 文件并将我想要的信息以 fasta 格式打印到新文件中。这是这样的： >Sequence_name_and_information\n strings\n >sequence_name.... 如此反复，直到获得 xml 文件中的所有序列。但我的问题是，在纯 xml 中，序列本身位于基因或序列基因座的标识符之前。基因或序列基因座应与“>”位于同一行。这是我从打开文件并解析它开始得到的代码：

open( New_File, "+>$PWD_file/$new_file" ) or die "\n\nCouldn't create file. Check permissions on location.\n\n";

    while ( my $lines = <INSD> ) {
        foreach ($lines) {
            if (m/<INSDSeq_locus>.*<\/INSDSeq_locus>/) {
                $lines =~ s/<INSDSeq_locus>//g and $lines =~ s/<\/INSDSeq_locus>//g and $lines =~ s/[a-z, |]//g; #this last bit may cause a bug of removing the letters in the genbank accession number
                $lines =~ s/ //g;
                chomp($lines);
                print New_File ">$lines\_";
            } elsif (m/<INSDSeq_organism>.*<\/INSDSeq_organism>/) {
                $lines =~ s/<INSDSeq_organism>//g and $lines =~ s/<\/INSDSeq_organism>//g;
                $lines =~ s/(\.|\?|\-| )/_/g;
                $lines =~ s/_{2,}/_/g;
                $lines =~ s/_{1,}$//;
                $lines =~ s/^>*_{1,}//; 
                $lines =~ s/\s{2}//g;
                chomp($lines);
                print New_File "$lines\n";
            } elsif (m/<INSDSeq_sequence>.*<\/INSDSeq_sequence>/) {
                $lines =~ s/<INSDSeq_sequence>//g and $lines =~ s/<\/INSDSeq_sequence>//g;
                $lines =~ s/ //g;
                chomp($lines);
                print New_File "$lines\n";
            }
        }
    }
    close INSD;
    close New_File;
}

有两个地方可以找到基因/位点信息。该信息位于以下两个标签之间：LOCUS_NAME 或 GENE_NAME。会有一个，或者另一个。如果一个有信息，另一个将是空的。无论哪种情况，都需要添加到 >....... 行的末尾。

谢谢，

AlphaA

PS--我尝试通过打开“$NA”，“>”将该信息打印到“文件”中序列，然后继续程序，找到基因信息，将其打印到 >行，然后读取 $NA 文件并将其打印到 > 之后的行；线。我希望这一点很清楚。

原文

So here is the issue. I am trying to parse a XML file of information from GenBank. This file contains information on multiple DNA sequences. I have this done already for two other xml formats from genbacnk (TINY xml and INSD xml), but pure xml gives me a headache. Here's how my program should work. Download an xml formated file that contains information on X number of sequences from GenBank. Run my perl script that searches through that xml file line by line and prints the information I want to a new file, in fasta format. Which is this: >Sequence_name_and_information\n sequences\n >sequence_name.... and on and on until you have all the sequences from the xml file. My issue though is that in pure xml the sequence itself comes before the identifier for the gene or locus of the sequences. The gene or locus of the sequences should go in the same line as the ">". Here is the code I have from the point of opening the file and parsing through it:

open( New_File, "+>$PWD_file/$new_file" ) or die "\n\nCouldn't create file. Check permissions on location.\n\n";

    while ( my $lines = <INSD> ) {
        foreach ($lines) {
            if (m/<INSDSeq_locus>.*<\/INSDSeq_locus>/) {
                $lines =~ s/<INSDSeq_locus>//g and $lines =~ s/<\/INSDSeq_locus>//g and $lines =~ s/[a-z, |]//g; #this last bit may cause a bug of removing the letters in the genbank accession number
                $lines =~ s/ //g;
                chomp($lines);
                print New_File ">$lines\_";
            } elsif (m/<INSDSeq_organism>.*<\/INSDSeq_organism>/) {
                $lines =~ s/<INSDSeq_organism>//g and $lines =~ s/<\/INSDSeq_organism>//g;
                $lines =~ s/(\.|\?|\-| )/_/g;
                $lines =~ s/_{2,}/_/g;
                $lines =~ s/_{1,}$//;
                $lines =~ s/^>*_{1,}//; 
                $lines =~ s/\s{2}//g;
                chomp($lines);
                print New_File "$lines\n";
            } elsif (m/<INSDSeq_sequence>.*<\/INSDSeq_sequence>/) {
                $lines =~ s/<INSDSeq_sequence>//g and $lines =~ s/<\/INSDSeq_sequence>//g;
                $lines =~ s/ //g;
                chomp($lines);
                print New_File "$lines\n";
            }
        }
    }
    close INSD;
    close New_File;
}

There are two places to find Gene/locus information. That info is found between either on of these two tags: LOCUS_NAME or GENE_NAME. There will be one, or the other. If one has info the other will be empty. In either case both need to be added to the end of the >....... line.

Thanks,

AlphaA

PS--I tried to print that info to a "file" by doing open "$NA", ">" the sequence to that, then moving on with the program, finding the gene info, printing it to the > line and then read the $NA file and printing it to the line right after the > line. I hope this is clear.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夏天碎花小短裙 2024-12-16 17:30:02

在我看来，您应该将 XSLT 与 XPath 导航到您需要的数据。

正如 @Brian 所建议的，使用已建立的 XML 解析技术和库会更容易。

甚至还有一个用于 XSLT 的 Perl 库

回复收藏 0 原文

烂柯人 2024-12-16 17:30:02

使用 XML 解析器。我不是生物学家，我不确定你想要的最终格式，但以此为起点应该很简单。匿名子中的 $_[1] 包含一个哈希引用，从上面我可以看出，我认为您希望通过解析所需标签的父标签来保存所有内容。以您想要的格式打印 $_[1] 的元素应该很容易：

use strict;
use warnings;

use XML::Rules;
use Data::Dumper;

my @rules = (
  _default => '',
  'INSDSeq_locus,INSDSeq_organism,INSDSeq_sequence' => 'content',
  INSDSeq  => sub { delete $_[1]{_content}; print Dumper $_[1]; return },
);

my $p = XML::Rules->new(rules => \@rules);
$p->parsefile('sequence.gbc.xml');

这样就可以很容易地只打印您想要的标签。或者，如果您想要一些其他标签，我真正可能做的是这样的（如果您只是逐个打印元素，则根本不需要 @tags 变量）

my @tags = qw(
  INSDSeq_locus
  INSDSeq_organism
  INSDSeq_sequence
);

my @rules = (
  _default => 'content',
  # Elements are, e.g. $_[1]{INSDSeq_locus}
  INSDSeq  => sub { print "$_: $_[1]{$_}\n" for @tags; return; },
);

：

my $p = XML::Rules->new(rules => \@rules, stripspaces => 4);

Use an XML parser. I'm not a biologist, and I'm not sure of the final format you want, but it should be simple with this as a starting point. $_[1] in the anonymous sub contains a hash reference with, from what I can tell above, everything that I think you want saved from parsing the parent tag of the tags you want. It should be easy to print out the elements of $_[1] in the format that you want it to be in:

use strict;
use warnings;

use XML::Rules;
use Data::Dumper;

my @rules = (
  _default => '',
  'INSDSeq_locus,INSDSeq_organism,INSDSeq_sequence' => 'content',
  INSDSeq  => sub { delete $_[1]{_content}; print Dumper $_[1]; return },
);

my $p = XML::Rules->new(rules => \@rules);
$p->parsefile('sequence.gbc.xml');

And that is just so that printing just the tags you want is easy. Or, if you want some other tags, What I really might do is this (you don't really need the @tags variable at all if you're just printing element by element):

my @tags = qw(
  INSDSeq_locus
  INSDSeq_organism
  INSDSeq_sequence
);

my @rules = (
  _default => 'content',
  # Elements are, e.g. $_[1]{INSDSeq_locus}
  INSDSeq  => sub { print "$_: $_[1]{$_}\n" for @tags; return; },
);

with:

my $p = XML::Rules->new(rules => \@rules, stripspaces => 4);

回复收藏 0 原文

~没有更多了~

关于作者

第七度阳光i

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

逐行解析 XML 文件

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

此刻的回忆

leejubao

不甘平庸

南巷近海

未蓝澄海的烟

gitee_v1qxdSBNo

友情链接

逐行解析 XML 文件

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

此刻的回忆

leejubao

不甘平庸

南巷近海

未蓝澄海的烟

gitee_v1qxdSBNo

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。