在 Perl 中，如何读取符合条件的行的部分内容？

发布于 2024-07-27 04:39:48 字数 873 浏览 4 评论 0原文

示例数据：

603       Some garbage data not related to me, 55, 113 ->

1-ENST0000        This is sample data blh blah blah blahhhh
2-ENSBTAP0        This is also some other sample data
21-ENADT)$        DO NOT WANT TO READ THIS LINE. 
3-ENSGALP0        This is third sample data
node #4           This is 4th sample data
node #5           This is 5th sample data

This is also part of the input file but i dont wish to read this. 
Branch -> 05 13, 
      44, 1,1,4,1

17, 1150

637                   YYYYYY: 2 : %

编辑：在上述数据中。这些部分的列宽是固定的，但可能有一些部分我不想阅读。上面的示例数据已被编辑以反映这一点。

因此，在这个输入文件中，我想将第一部分“1-ENST0000”的内容读入一个数组，并将“2-ENSBTAP0”的内容读入一个单独的数组，依此类推。

我在想出一个定义模式的正则表达式时遇到了麻烦...前三行有 -ENS 然后还可以有 node #<这里有一些数字>

原文

Sample Data:

603       Some garbage data not related to me, 55, 113 ->

1-ENST0000        This is sample data blh blah blah blahhhh
2-ENSBTAP0        This is also some other sample data
21-ENADT)$        DO NOT WANT TO READ THIS LINE. 
3-ENSGALP0        This is third sample data
node #4           This is 4th sample data
node #5           This is 5th sample data

This is also part of the input file but i dont wish to read this. 
Branch -> 05 13, 
      44, 1,1,4,1

17, 1150

637                   YYYYYY: 2 : %

EDIT: In the above data. The column width is fixed for the sections but there might be some sections I do not wish to read. above sample data has been edited to reflect that.

So in this input file I want to read contents of first section '1-ENST0000' into an array and contents of '2-ENSBTAP0' into a separate array and so on.

I am having trouble coming up with a regex that will define the pattern ...first three lines have <someNumber>-ENS<someotherstuf> and then there can also be node #<some number here>

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

暮倦 2024-08-03 04:39:48

这真的是一个固定列文件吗？如果是这样，那么就不用担心正则表达式了。只需按照列宽进行分割，或许可以修剪第 1 列的尾随空白。

回复收藏 0 原文

海夕 2024-08-03 04:39:48

好的，根据您后来的评论，这与上一个问题有点不同。另外，我现在意识到 node #54 是第一列中的有效条目。

更新：我现在也意识到您不需要第一列。

更新： 一般来说，您既不想也不需要在 Perl 中处理字符数组。

更新：现在您已经澄清了应该跳过什么和不应该跳过什么，这里有一个处理该问题的版本。添加模式以适应 if 条件。

#!/usr/bin/perl

use strict;
use warnings;

my @data;

while ( <DATA> ) {
    chomp;

    if ( /^[0-9]+-ENS.{5} +(.+)$/
            or /^node #[0-9]+ +(.+)$/
    ) {
        push @data, [ split //, $1 ];
    }
}

use Data::Dumper;
print Dumper \@data;

__DATA__
603       Some garbage data not related to me, 55, 113 ->

1-ENST0000        This is sample data blh blah blah blahhhh
2-ENSBTAP0        This is also some other sample data
21-ENADT)$        DO NOT WANT TO READ THIS LINE. 
3-ENSGALP0        This is third sample data
node #4           This is 4th sample data
node #5           This is 5th sample data

This is also part of the input file but i dont wish to read this. 
Branch -> 05 13, 
      44, 1,1,4,1

17, 1150

637                   YYYYYY: 2 : %

至于学习如何钓鱼，我建议您阅读 perldoc perltoc 中相关的所有内容。

OK, based on your later comment, this is a little different than the previous question. Also, I now realize that node #54 is a valid entry in the first column.

Update: I now also realize you do not need the first column.

Update: In general, you neither want to nor need to deal with character arrays in Perl.

Update: Now that you clarified the what should and should not be skipped, here is a version that deals with that. Add patterns to taste in the if condition.

#!/usr/bin/perl

use strict;
use warnings;

my @data;

while ( <DATA> ) {
    chomp;

    if ( /^[0-9]+-ENS.{5} +(.+)$/
            or /^node #[0-9]+ +(.+)$/
    ) {
        push @data, [ split //, $1 ];
    }
}

use Data::Dumper;
print Dumper \@data;

__DATA__
603       Some garbage data not related to me, 55, 113 ->

1-ENST0000        This is sample data blh blah blah blahhhh
2-ENSBTAP0        This is also some other sample data
21-ENADT)$        DO NOT WANT TO READ THIS LINE. 
3-ENSGALP0        This is third sample data
node #4           This is 4th sample data
node #5           This is 5th sample data

This is also part of the input file but i dont wish to read this. 
Branch -> 05 13, 
      44, 1,1,4,1

17, 1150

637                   YYYYYY: 2 : %

As for learning how to fish, I recommend you read everything related in perldoc perltoc.

回复收藏 0 原文

~没有更多了~