在 Perl 中解析垂直分隔的文件
我有一个如下所示的文件:
*NEWRECORD
RECTYPE = D
MH = Calcimycin
AQ = AA
MED = *62
*NEWRECORD
RECTYPE = D
MH = Urinary Bladder
AQ = AB AH BS CH CY DE EM EN GD IM IN IR ME MI PA PH PP PS RA RE RI SE SU TR UL US VI
CX = consider also terms at CYST- and VESIC-
MED = *1359
每个记录块都有不同的行数(例如CX
条目并不总是存在)。 但如果 CX
存在,则仅显示为 1 个条目。 我们想要得到一个以“MH”为键、“CX”为值的哈希。
因此,解析上述数据我们希望得到这样的结构:
$VAR = { "Urinary Bladder" => ["CYST-" , "VESIC-"]};
解析它的正确方法是什么?
我坚持这个,这并没有给我想要的结果。
use Data::Dumper;
my %bighash;
my $key = "";
my $cx = "";
while (<>) {
chomp;
if (/^MH = (\w+/)) {
$key = $1;
push @{$bighash{$key}}, " ";
}
elsif ( /^CX = (\w+/)) {
$cx = $1;
}
else {
push @{$bighash{$key}}, $cx;
}
}
I have a file that looks like this:
*NEWRECORD
RECTYPE = D
MH = Calcimycin
AQ = AA
MED = *62
*NEWRECORD
RECTYPE = D
MH = Urinary Bladder
AQ = AB AH BS CH CY DE EM EN GD IM IN IR ME MI PA PH PP PS RA RE RI SE SU TR UL US VI
CX = consider also terms at CYST- and VESIC-
MED = *1359
Each record chunk has different number of lines, (e.g. CX
entry does not always present).
But if CX
exists, in only appear as 1 entry only.
We want to get a Hash that takes "MH" as keys and "CX" as values.
Hence parsing the above data we hope to get this structure:
$VAR = { "Urinary Bladder" => ["CYST-" , "VESIC-"]};
What's the right way to parse it?
I'm stuck with this, that doesn't give me result as I want.
use Data::Dumper;
my %bighash;
my $key = "";
my $cx = "";
while (<>) {
chomp;
if (/^MH = (\w+/)) {
$key = $1;
push @{$bighash{$key}}, " ";
}
elsif ( /^CX = (\w+/)) {
$cx = $1;
}
else {
push @{$bighash{$key}}, $cx;
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
如果您使用
$/
一次读取一段数据,这会变得更简单。我很惊讶没有其他人提出这一建议。输出如下所示:
This becomes simpler if you use
$/
to read the data a paragraph at a time. I'm surprised that no-one else has suggested that.The output looks like this:
尝试以下操作。检查更改(或听 Aki 的说法)可能是个好主意:
更新:使用 Regex-Captures 而不是
split
和grep
Try the following. And it's probably a good idea to examine the changes (or listen to Aki):
Update: Used Regex-Captures instead of
split
andgrep
最近没有练习我的 Perl 功夫,但最后的 else 语句看起来很可疑。
尝试删除最后一个 else 语句并在第二个 elsif 之后直接添加“push”语句。基本上匹配CX后就直接进行push操作。
另外,您知道 MH 必须始终出现在 CX 之前,否则逻辑就会中断。
Haven't practiced my perl kung fu lately but the last else statement looks fishy.
Try dropping the last else statement and add the 'push' statement straight after the second elsif. Basically do the push operation straight after matching the CX.
Also, you know that MH must always appear before a CX otherwise the logic breaks.
/^MH = (\w+/)
应为/^MH (\w+)/
。您可能需要使用\s+
或\s*
而不是空格if
块中删除推送else
在elsif
块use strict;
和use warnings;
添加到您的代码中尝试这些,如果您有困难,我会帮助您编写代码
/^MH = (\w+/)
should be/^MH (\w+)/
. You may want to use\s+
or\s*
instead of spaceif
blockelse
blockelsif
block Push $cx into hash using the key $keyuse strict;
anduse warnings;
to your codeTry these and if you have difficulty i will help you with the code
使用 Config::Tiny 或 Config::YAML 对文件进行初始传递,然后单独循环每个记录。不过,如果您的文件大约有千兆字节或更多,这可能会耗尽您所有的内存。
It might be simpler to use Config::Tiny or Config::YAML to do an initial pass over the file and then loop through each record individually. Although if your file is like a gigabyte or more this might suck up all your memory.
这是我很快做的事情,我希望它能给你一个开始的想法:
Here is something I quickly did, I hope it gives you an idea to start from: