在 Perl 中解析多行数据

发布于 2024-09-30 17:31:56 字数 391 浏览 9 评论 0原文

我有一些数据需要分析。数据是多行的,每个块由换行符分隔。所以,就像这样,

Property 1: 1234
Property 2: 34546
Property 3: ACBGD

Property 1: 1234
Property 4: 4567

Property 1: just
Property 3: an
Property 5: simple
Property 6: example

我需要过滤掉那些存在某些特定属性的数据块。例如,仅那些具有属性 4 的块,仅具有属性 3 和 6 的块等。我可能还需要根据这些属性的值进行选择,因此例如仅那些具有属性 3 且其值为 '一个'。

我将如何在 Perl 中做到这一点。我尝试用“\n”分割它,但似乎无法正常工作。我错过了什么吗?

I have some data that I need to analyze. The data is multilined and each block is separated by a newline. So, it is something like this

Property 1: 1234
Property 2: 34546
Property 3: ACBGD

Property 1: 1234
Property 4: 4567

Property 1: just
Property 3: an
Property 5: simple
Property 6: example

I need to filter out those data blocks that have some particular Property present. For example, only those that have Property 4, only those that have Property 3 and 6 both etc. I might also need to choose based upon the value at these Properties, so for example only those blocks that have Property 3 and its value is 'an'.

How would I do this in Perl. I tried splitting it by "\n" but didn't seem to work properly. Am I missing something?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

羅雙樹 2024-10-07 17:31:56

使这个任务变得简单的秘诀是使用 $/ 变量将 Perl 置于“段落模式”。这样可以轻松地一次处理一个记录。然后你可以用 grep 之类的东西过滤它们。

#!/usr/bin/perl

use strict;
use warnings;

my @data = do {
  local $/ = '';
  <DATA>;
};

my @with_4   = grep { /^Property 4:/m } @data;

my @with_3   = grep { /^Property 3:/m } @data;
my @with_3_6 = grep { /^Property 6:/m } @with_3;

print scalar @with_3_6;

__DATA__
Property 1: 1234
Property 2: 34546
Property 3: ACBGD

Property 1: 1234
Property 4: 4567

Property 1: just
Property 3: an
Property 5: simple
Property 6: example

在该示例中,我将每条记录作为纯文本处理。对于更复杂的工作,我可能会将每个记录转换为散列。

#!/usr/bin/perl

use strict;
use warnings;

use Data::Dumper;

my @data;

{
  local $/ = '';

  while (<DATA>) {
    chomp;

    my @rec = split /\n/;
    my %prop;
    foreach my $r (@rec) {
      my ($k, $v) = split /:\s+/, $r;
      $prop{$k} = $v;
    }

    push @data, \%prop;
  }
}

my @with_4   = grep { exists $_->{'Property 4'} } @data;

my @with_3_6 = grep { exists $_->{'Property 3'} and
                      exists $_->{'Property 6'} } @data;

my @with_3an = grep { exists $_->{'Property 3'} and
                      $_->{'Property 3'} eq 'an' } @data;

print Dumper @with_3an;

__DATA__
Property 1: 1234
Property 2: 34546
Property 3: ACBGD

Property 1: 1234
Property 4: 4567

Property 1: just
Property 3: an
Property 5: simple
Property 6: example

The secret to making this task simple is to use the $/ variable to put Perl into "paragraph mode". That makes it easy to process your records one at a time. You can then filter them with something like grep.

#!/usr/bin/perl

use strict;
use warnings;

my @data = do {
  local $/ = '';
  <DATA>;
};

my @with_4   = grep { /^Property 4:/m } @data;

my @with_3   = grep { /^Property 3:/m } @data;
my @with_3_6 = grep { /^Property 6:/m } @with_3;

print scalar @with_3_6;

__DATA__
Property 1: 1234
Property 2: 34546
Property 3: ACBGD

Property 1: 1234
Property 4: 4567

Property 1: just
Property 3: an
Property 5: simple
Property 6: example

In that example I'm processing each record as plain text. For more complex work, I'd probably turn each record into a hash.

#!/usr/bin/perl

use strict;
use warnings;

use Data::Dumper;

my @data;

{
  local $/ = '';

  while (<DATA>) {
    chomp;

    my @rec = split /\n/;
    my %prop;
    foreach my $r (@rec) {
      my ($k, $v) = split /:\s+/, $r;
      $prop{$k} = $v;
    }

    push @data, \%prop;
  }
}

my @with_4   = grep { exists $_->{'Property 4'} } @data;

my @with_3_6 = grep { exists $_->{'Property 3'} and
                      exists $_->{'Property 6'} } @data;

my @with_3an = grep { exists $_->{'Property 3'} and
                      $_->{'Property 3'} eq 'an' } @data;

print Dumper @with_3an;

__DATA__
Property 1: 1234
Property 2: 34546
Property 3: ACBGD

Property 1: 1234
Property 4: 4567

Property 1: just
Property 3: an
Property 5: simple
Property 6: example
所谓喜欢 2024-10-07 17:31:56

取决于每个属性集的大小以及您拥有多少内存...

我会使用一个简单的状态机,它顺序扫描文件 - 使用逐行顺序扫描,而不是多行 - 添加每个属性/id /value 为以 id 为键的哈希值。当您收到空行或文件结尾时,确定是否应过滤掉散列的元素,并根据需要发出它们,然后重置散列。

Dependent on the size of each property set and how much memory you have...

I'd use a simple state machine that scans sequentially through the file - with a line-by-line sequential scan, not multiline - adding each property/id/value to a hash keyed on id. When you get a blank line or end-of-file, determine whether the elements of the hash should be filtered in or out, and emit them as necessary, then reset the hash.

盗琴音 2024-10-07 17:31:56

又快又脏:

my $string = <<END;
Property 1: 1234
Property 2: 34546
Property 3: ACBGD

Property 1: 1234
Property 4: 4567

Property 1: just
Property 3: an
Property 5: simple
Property 6: example
END

my @blocks = split /\n\n/, $string;

my @desired_blocks = grep /Property 1: 1234/, @blocks;

print join("\n----\n", @desired_blocks), "\n";

Quick and dirty:

my $string = <<END;
Property 1: 1234
Property 2: 34546
Property 3: ACBGD

Property 1: 1234
Property 4: 4567

Property 1: just
Property 3: an
Property 5: simple
Property 6: example
END

my @blocks = split /\n\n/, $string;

my @desired_blocks = grep /Property 1: 1234/, @blocks;

print join("\n----\n", @desired_blocks), "\n";
海夕 2024-10-07 17:31:56
#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper;

my $propertyRef;
my $propertyRefIdx = 0;

while (<>) {
    chomp($_);
    if ($_ =~ /Property (\d+): (.*)/) {
        my $propertyKey = $1;
        my $propertyValue = $2;

        $propertyRef->[$propertyRefIdx]->{$propertyKey} = $propertyValue;
    }
    else {
        $propertyRefIdx++;
    }
}

print Dumper $propertyRef;

假设该脚本名为 propertyParser.pl,并且您有一个包含属性和值的文件,名为 properties.txt。您可以这样调用:

$ propertyParser.pl < properties.txt

使用所有数据填充 $propertyRef 后,您可以循环遍历元素并根据您需要应用的任何规则(例如某些键和/或值组合:

foreach my $property (@{$propertyRef}) {
    if (defined $property->{1} && defined $property->{3} 
                               && ! defined $property->{6}) {
        # do something for keys 1 and 3 but not 6, etc.
    }
}
#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper;

my $propertyRef;
my $propertyRefIdx = 0;

while (<>) {
    chomp($_);
    if ($_ =~ /Property (\d+): (.*)/) {
        my $propertyKey = $1;
        my $propertyValue = $2;

        $propertyRef->[$propertyRefIdx]->{$propertyKey} = $propertyValue;
    }
    else {
        $propertyRefIdx++;
    }
}

print Dumper $propertyRef;

Let's say this script is called propertyParser.pl and you have a file containing the properties and values called properties.txt. You could call this as follows:

$ propertyParser.pl < properties.txt

Once you have populated $propertyRef with all your data, you can then loop through elements and filter them based on whatever rules you need to apply, such as certain key and/or value combinations:

foreach my $property (@{$propertyRef}) {
    if (defined $property->{1} && defined $property->{3} 
                               && ! defined $property->{6}) {
        # do something for keys 1 and 3 but not 6, etc.
    }
}
笑梦风尘 2024-10-07 17:31:56

您的记录分隔符应为“\n\n”。每行都以 1 结尾,并且可以通过双换行符来区分块。使用这个想法,可以很容易地过滤掉具有属性 4 的块。

use strict;
use warnings;
use English qw<$RS>;

open( my $inh, ... ) or die "I'm dead!";

local $RS = "\n\n";
while ( my $block = <$inh> ) { 
    if ( my ( $prop4 ) = $block =~ m/^Property 4:\s+(.*)/m ) { 
        ...
    }
    if ( my ( $prop3, $prop6 ) 
             = $block =~ m/
        ^Property \s+ 3: \s+ ([^\n]*)
        .*?
        ^Property \s+ 6: \s+ ([^\n]*)
        /smx 
       ) {
        ...
    }
}

两个表达式都使用多行 ('m') 标志,因此 ^ 适用于任何行开始。最后一个使用标志在“.”中包含换行符。表达式('s')和扩展语法('x'),其中忽略表达式中的空格。

如果数据相当小,您可以一次性处理所有数据,如下

use strict;
use warnings;
use English qw<$RS>;

local $RS = "\n\n";
my @block
    = map { { m/^Property \s+ (\d+): \s+ (.*?\S) \s+/gmx } } <DATA>
    ;
print Data::Dumper->Dump( [ \@block ], [ '*block' ] ), "\n";

所示: 结果如下:

@block = (
           {
             '1' => '1234',
             '3' => 'ACBGD',
             '2' => '34546'
           },
           {
             '4' => '4567',
             '1' => '1234'
           },
           {
             '6' => 'example',
             '1' => 'just',
             '3' => 'an',
             '5' => 'simple'
           }
         );

Your record separator should be "\n\n". Every line ends with one, and you differentiate a block by a double newline. Using this idea, it was rather easy to filter out the blocks with Property 4.

use strict;
use warnings;
use English qw<$RS>;

open( my $inh, ... ) or die "I'm dead!";

local $RS = "\n\n";
while ( my $block = <$inh> ) { 
    if ( my ( $prop4 ) = $block =~ m/^Property 4:\s+(.*)/m ) { 
        ...
    }
    if ( my ( $prop3, $prop6 ) 
             = $block =~ m/
        ^Property \s+ 3: \s+ ([^\n]*)
        .*?
        ^Property \s+ 6: \s+ ([^\n]*)
        /smx 
       ) {
        ...
    }
}

Both expressions use a multiline ('m') flag, so that ^ applies to any line start. The last one uses the flag to include newlines in '.' expressions ('s') and the extended syntax ('x') which, among other things, ignores whitespace within the expression.

If the data was rather small, you could process it all in one go like:

use strict;
use warnings;
use English qw<$RS>;

local $RS = "\n\n";
my @block
    = map { { m/^Property \s+ (\d+): \s+ (.*?\S) \s+/gmx } } <DATA>
    ;
print Data::Dumper->Dump( [ \@block ], [ '*block' ] ), "\n";

Which shows the result to be:

@block = (
           {
             '1' => '1234',
             '3' => 'ACBGD',
             '2' => '34546'
           },
           {
             '4' => '4567',
             '1' => '1234'
           },
           {
             '6' => 'example',
             '1' => 'just',
             '3' => 'an',
             '5' => 'simple'
           }
         );
表情可笑 2024-10-07 17:31:56

检查 $/ 变量将为您做什么,例如此处的解释。您可以将“行尾”分隔符设置为您喜欢的任何内容。您可以尝试将其设置为“\n\n”,

$/ = "\n\n";
foreach my $property (<DATA>)
    {
    print "$property\n";
    }


__DATA__
Property 1: 1234
Property 2: 34546
Property 3: ACBGD

Property 1: 1234
Property 4: 4567

Property 1: just
Property 3: an
Property 5: simple
Property 6: example

因为您的数据元素似乎是由空行分隔的,这将一一读取行的每个属性组。

您还可以将整个文件读入数组并从内存中处理它

my(@lines) =

Check what the $/ variable will do for you, for example explanation here. You can set the 'end of line' separator to be whatever you please. You could try setting it to '\n\n'

$/ = "\n\n";
foreach my $property (<DATA>)
    {
    print "$property\n";
    }


__DATA__
Property 1: 1234
Property 2: 34546
Property 3: ACBGD

Property 1: 1234
Property 4: 4567

Property 1: just
Property 3: an
Property 5: simple
Property 6: example

As your data elements seem to be deilmited by blank lines this will read each property group of lines one by one.

You could also read the entire file into an array and process it from memory

my(@lines) = <DATA>

残龙傲雪 2024-10-07 17:31:56

假设您的数据存储在一个文件中(假设为 mydata.txt),您可以编写以下 perl 脚本(我们称他为 Bob.pl):

my @currentBlock = ();
my $displayCurrentBlock = 0;
# This will iterate on each line of the file
while (<>) {
  # We check the content of $_ (the current line)
  if ($_ =~ /^\s*$/) {
    # $_ is an empty line, so we display the current block if needed
    print @currentBlock if $displayCurrentBlock;
    # Current block and display status are resetted
    @currentBlock = ();
    $displayCurrentBlock = 0;
  } else{
    # $_ is not an empty line, we add it to the current block
    push @currentBlock, $_;
    # We set the display status to true if a certain condition is met
    $displayCurrentBlock = 1 if ($_ =~ /Property 3: an\s+$/);
  }
}
# A last check and print for the last block
print @currentBlock if $displayCurrentBlock;

接下来,您只需启动 perl Bob.pl perl Bob.pl < mydata.txt,瞧!

localhost> perl Bob.pl < mydata.txt
Property 1: just
Property 3: an
Property 5: simple
Property 6: example

Assuming that your data are stored into a file (let's say mydata.txt), you could write the following perl script (let's call him Bob.pl):

my @currentBlock = ();
my $displayCurrentBlock = 0;
# This will iterate on each line of the file
while (<>) {
  # We check the content of $_ (the current line)
  if ($_ =~ /^\s*$/) {
    # $_ is an empty line, so we display the current block if needed
    print @currentBlock if $displayCurrentBlock;
    # Current block and display status are resetted
    @currentBlock = ();
    $displayCurrentBlock = 0;
  } else{
    # $_ is not an empty line, we add it to the current block
    push @currentBlock, $_;
    # We set the display status to true if a certain condition is met
    $displayCurrentBlock = 1 if ($_ =~ /Property 3: an\s+$/);
  }
}
# A last check and print for the last block
print @currentBlock if $displayCurrentBlock;

Next, you just have to lauch perl Bob.pl < mydata.txt, and voila !

localhost> perl Bob.pl < mydata.txt
Property 1: just
Property 3: an
Property 5: simple
Property 6: example
乞讨 2024-10-07 17:31:56

关于问题的第一部分,您可以使用perl的“段落模式”读取记录-00 命令行选项,例如:

#!/usr/bin/perl -00

my @data = <>;

# Print the last block.
print $data[-1], "\n"

In relation to the first part of your question, you can read records in "paragraph mode" using perl's -00 commandline option, for example:

#!/usr/bin/perl -00

my @data = <>;

# Print the last block.
print $data[-1], "\n"
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文