在 Perl 中解析垂直分隔的文件

发布于 2024-12-16 00:16:42 字数 871 浏览 0 评论 0原文

我有一个如下所示的文件:

*NEWRECORD
RECTYPE = D
MH = Calcimycin
AQ = AA 
MED = *62

*NEWRECORD
RECTYPE = D
MH = Urinary Bladder
AQ = AB AH BS CH CY DE EM EN GD IM IN IR ME MI PA PH PP PS RA RE RI SE SU TR UL US VI
CX = consider also terms at CYST- and VESIC-
MED = *1359

每个记录块都有不同的行数(例如CX条目并不总是存在)。 但如果 CX 存在,则仅显示为 1 个条目。 我们想要得到一个以“MH”为键、“CX”为值的哈希。

因此,解析上述数据我们希望得到这样的结构:

$VAR = {  "Urinary Bladder" => ["CYST-" , "VESIC-"]};

解析它的正确方法是什么?

我坚持这个,这并没有给我想要的结果。

use Data::Dumper;
my %bighash;
my $key = "";
my $cx = "";
while (<>) {

   chomp;

   if (/^MH = (\w+/)) {

      $key = $1;     
      push @{$bighash{$key}}, " ";
   }
   elsif ( /^CX = (\w+/)) {
      $cx = $1;

   }
   else {
      push @{$bighash{$key}}, $cx;

   }

} 

I have a file that looks like this:

*NEWRECORD
RECTYPE = D
MH = Calcimycin
AQ = AA 
MED = *62

*NEWRECORD
RECTYPE = D
MH = Urinary Bladder
AQ = AB AH BS CH CY DE EM EN GD IM IN IR ME MI PA PH PP PS RA RE RI SE SU TR UL US VI
CX = consider also terms at CYST- and VESIC-
MED = *1359

Each record chunk has different number of lines, (e.g. CX entry does not always present).
But if CX exists, in only appear as 1 entry only.
We want to get a Hash that takes "MH" as keys and "CX" as values.

Hence parsing the above data we hope to get this structure:

$VAR = {  "Urinary Bladder" => ["CYST-" , "VESIC-"]};

What's the right way to parse it?

I'm stuck with this, that doesn't give me result as I want.

use Data::Dumper;
my %bighash;
my $key = "";
my $cx = "";
while (<>) {

   chomp;

   if (/^MH = (\w+/)) {

      $key = $1;     
      push @{$bighash{$key}}, " ";
   }
   elsif ( /^CX = (\w+/)) {
      $cx = $1;

   }
   else {
      push @{$bighash{$key}}, $cx;

   }

} 

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

信愁 2024-12-23 00:16:42

如果您使用 $/ 一次读取一段数据,这会变得更简单。我很惊讶没有其他人提出这一建议。

#!/usr/bin/perl

use strict;
use warnings;
use 5.010;

use Data::Dumper;

my %bighash;

$/ = '';

while (<DATA>) {
  if (my ($k) = /^MH = (.*?)$/m and my ($v) = /^CX = (.*?)$/m) {
    $bighash{$k} = [ $v =~ /([A-Z]+-)/g ];
  }
}

say Dumper \%bighash;

__DATA__
*NEWRECORD
RECTYPE = D
MH = Calcimycin
AQ = AA 
MED = *62

*NEWRECORD
RECTYPE = D
MH = Urinary Bladder
AQ = AB AH BS CH CY DE EM EN GD IM IN IR ME MI PA PH PP PS RA RE RI SE SU TR UL US VI
CX = consider also terms at CYST- and VESIC-
MED = *1359

输出如下所示:

$VAR1 = {
          'Urinary Bladder' => [
                                 'CYST-',
                                 'VESIC-'
                               ]
        };

This becomes simpler if you use $/ to read the data a paragraph at a time. I'm surprised that no-one else has suggested that.

#!/usr/bin/perl

use strict;
use warnings;
use 5.010;

use Data::Dumper;

my %bighash;

$/ = '';

while (<DATA>) {
  if (my ($k) = /^MH = (.*?)$/m and my ($v) = /^CX = (.*?)$/m) {
    $bighash{$k} = [ $v =~ /([A-Z]+-)/g ];
  }
}

say Dumper \%bighash;

__DATA__
*NEWRECORD
RECTYPE = D
MH = Calcimycin
AQ = AA 
MED = *62

*NEWRECORD
RECTYPE = D
MH = Urinary Bladder
AQ = AB AH BS CH CY DE EM EN GD IM IN IR ME MI PA PH PP PS RA RE RI SE SU TR UL US VI
CX = consider also terms at CYST- and VESIC-
MED = *1359

The output looks like this:

$VAR1 = {
          'Urinary Bladder' => [
                                 'CYST-',
                                 'VESIC-'
                               ]
        };
阳光下的泡沫是彩色的 2024-12-23 00:16:42

尝试以下操作。检查更改(或听 Aki 的说法)可能是个好主意:

use strict;
use warnings;

use Data::Dumper;

my %bighash;
my $current_key;

while ( <DATA> ) {

    chomp;

    if ( m/^MH = (.+)/ ) {
        $current_key = $1;

    } elsif ( /^CX = (.+)/ ) {
        my $text = $1;
        $bighash{ $current_key } = [ $text =~ /([A-Z]+-)/g ];

    }
}

print Dumper ( \%bighash );

__DATA__
*NEWRECORD
RECTYPE = D
MH = Calcimycin
AQ = AA 
MED = *62

*NEWRECORD
RECTYPE = D
MH = Urinary Bladder
AQ = AB AH BS CH CY DE EM EN GD IM IN IR ME MI PA PH PP PS RA RE RI SE SU TR UL US VI
CX = consider also terms at CYST- and VESIC-
MED = *1359

更新:使用 Regex-Captures 而不是 splitgrep

Try the following. And it's probably a good idea to examine the changes (or listen to Aki):

use strict;
use warnings;

use Data::Dumper;

my %bighash;
my $current_key;

while ( <DATA> ) {

    chomp;

    if ( m/^MH = (.+)/ ) {
        $current_key = $1;

    } elsif ( /^CX = (.+)/ ) {
        my $text = $1;
        $bighash{ $current_key } = [ $text =~ /([A-Z]+-)/g ];

    }
}

print Dumper ( \%bighash );

__DATA__
*NEWRECORD
RECTYPE = D
MH = Calcimycin
AQ = AA 
MED = *62

*NEWRECORD
RECTYPE = D
MH = Urinary Bladder
AQ = AB AH BS CH CY DE EM EN GD IM IN IR ME MI PA PH PP PS RA RE RI SE SU TR UL US VI
CX = consider also terms at CYST- and VESIC-
MED = *1359

Update: Used Regex-Captures instead of split and grep

我做我的改变 2024-12-23 00:16:42

最近没有练习我的 Perl 功夫,但最后的 else 语句看起来很可疑。

尝试删除最后一个 else 语句并在第二个 elsif 之后直接添加“push”语句。基本上匹配CX后就直接进行push操作。

另外,您知道 MH 必须始终出现在 CX 之前,否则逻辑就会中断。

Haven't practiced my perl kung fu lately but the last else statement looks fishy.

Try dropping the last else statement and add the 'push' statement straight after the second elsif. Basically do the push operation straight after matching the CX.

Also, you know that MH must always appear before a CX otherwise the logic breaks.

凉风有信 2024-12-23 00:16:42
  • 修复正则表达式
    /^MH = (\w+/) 应为 /^MH (\w+)/。您可能需要使用 \s+\s* 而不是空格
  • if 块中删除推送
  • 删除 elseelsif
  • 中,使用键 $key 将 $cx 推入哈希
  • 列表项
  • use strict;use warnings; 添加到您的代码中

尝试这些,如果您有困难,我会帮助您编写代码

  • Fix the regular expressions
    /^MH = (\w+/) should be /^MH (\w+)/. You may want to use \s+ or \s* instead of space
  • Remove push from the if block
  • Remove else block
  • In the elsif block Push $cx into hash using the key $key
  • List item
  • Add use strict; and use warnings; to your code

Try these and if you have difficulty i will help you with the code

浅唱ヾ落雨殇 2024-12-23 00:16:42

使用 Config::TinyConfig::YAML 对文件进行初始传递,然后单独循环每个记录。不过,如果您的文件大约有千兆字节或更多,这可能会耗尽您所有的内存。

It might be simpler to use Config::Tiny or Config::YAML to do an initial pass over the file and then loop through each record individually. Although if your file is like a gigabyte or more this might suck up all your memory.

胡大本事 2024-12-23 00:16:42

这是我很快做的事情,我希望它能给你一个开始的想法:

use Data::Dumper;
# Set your record separator
{
  local $/="*NEWRECORD\n";

  while(<DATA>) {
    # Get rid of your separator
    chomp($_);
    print "Parsing record # $.\n";
    push @records, $_ if ( $_ );
  }
}


foreach (@records) {
  # Get your sub records
  @lines = split(/\n/,$_);
  my %h = ();
  my %result = ();
  # Create a hash from your sub records
  foreach (@lines) {
    ($k, $v) = split(/\s*=\s*/, $_);
    $h{$k} = $v;
  }
  # Parse the CX and strip the lower case comments
  $h{ 'CX' } =~ s/[a-z]//g;
  $h{ 'CX' } =~ s/^\s+//g;
  # Have the upper case values as an array ref in the result hash
  $result{ $h{ 'MH' } } = [ split( /\s+/, $h{ 'CX' } ) ] if ( $h{ 'CX' } );
  print Dumper( \%h );
  print "Result:\n";
  print Dumper( \%result );
}
__DATA__
*NEWRECORD
RECTYPE = D
MH = Calcimycin
AQ = AA 
MED = *62

*NEWRECORD
RECTYPE = D
MH = Urinary Bladder
AQ = AB AH BS CH CY DE EM EN GD IM IN IR ME MI PA PH PP PS RA RE RI SE SU TR UL US VI
CX = consider also terms at CYST- and VESIC-
MED = *1359

Here is something I quickly did, I hope it gives you an idea to start from:

use Data::Dumper;
# Set your record separator
{
  local $/="*NEWRECORD\n";

  while(<DATA>) {
    # Get rid of your separator
    chomp($_);
    print "Parsing record # $.\n";
    push @records, $_ if ( $_ );
  }
}


foreach (@records) {
  # Get your sub records
  @lines = split(/\n/,$_);
  my %h = ();
  my %result = ();
  # Create a hash from your sub records
  foreach (@lines) {
    ($k, $v) = split(/\s*=\s*/, $_);
    $h{$k} = $v;
  }
  # Parse the CX and strip the lower case comments
  $h{ 'CX' } =~ s/[a-z]//g;
  $h{ 'CX' } =~ s/^\s+//g;
  # Have the upper case values as an array ref in the result hash
  $result{ $h{ 'MH' } } = [ split( /\s+/, $h{ 'CX' } ) ] if ( $h{ 'CX' } );
  print Dumper( \%h );
  print "Result:\n";
  print Dumper( \%result );
}
__DATA__
*NEWRECORD
RECTYPE = D
MH = Calcimycin
AQ = AA 
MED = *62

*NEWRECORD
RECTYPE = D
MH = Urinary Bladder
AQ = AB AH BS CH CY DE EM EN GD IM IN IR ME MI PA PH PP PS RA RE RI SE SU TR UL US VI
CX = consider also terms at CYST- and VESIC-
MED = *1359
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文