帮助使用 perl 代码解析文件

发布于 2024-11-18 07:23:32 字数 1374 浏览 1 评论 0原文

我是 Perl 新手,对语法有疑问。我收到此代码用于解析包含特定信息的文件。我想知道子例程 get_numberif (/DID/) 部分在做什么?这是利用正则表达式吗?我不太确定,因为正则表达式匹配看起来像 $_ =~ /some expression/。最后,get_number 子例程中的 while 循环是否必要?

#!/usr/bin/env perl

use Scalar::Util qw/ looks_like_number /;
use WWW::Mechanize;

# store the name of all the OCR file names in an array
my @file_list=qw{
   blah.txt
};

# set the scalar index to zero
my $file_index=0;

# open the file titled 'outputfile.txt' and write to it
# (or indicate that the file can't be opened)
open(OUT_FILE, '>', 'outputfile.txt')
    or die "Can't open output file\n";

while($file_index < 1){
    # open the OCR file and store it in the filehandle IN_FILE
    open(IN_FILE, '<', "$file_list[$file_index]")
        or die "Can't read source file!\n";

    print "Processing file $file_list[$file_index]\n";
    while(<IN_FILE>){
            my $citing_pat=get_number();
            get_country($citing_pat);
    }
    $file_index=$file_index+1;
}
close IN_FILE;
close OUT_FILE;

get_number 的定义如下。

sub get_number {
    while(<IN_FILE>){
        if(/DID/){
            my @fields=split / /;
            chomp($fields[3]);
            if($fields[3] !~ /\D/){
                return $fields[3];
            }
        }
    }
}

I am new to Perl and have a question about the syntax. I received this code for parsing a file containing specific information. I was wondering what the if (/DID/) part of the subroutine get_number is doing? Is this leveraging regular expressions? I'm not quite sure because regular-expression matches look like $_ =~ /some expression/. Finally, is the while loop in the get_number subroutine necessary?

#!/usr/bin/env perl

use Scalar::Util qw/ looks_like_number /;
use WWW::Mechanize;

# store the name of all the OCR file names in an array
my @file_list=qw{
   blah.txt
};

# set the scalar index to zero
my $file_index=0;

# open the file titled 'outputfile.txt' and write to it
# (or indicate that the file can't be opened)
open(OUT_FILE, '>', 'outputfile.txt')
    or die "Can't open output file\n";

while($file_index < 1){
    # open the OCR file and store it in the filehandle IN_FILE
    open(IN_FILE, '<', "$file_list[$file_index]")
        or die "Can't read source file!\n";

    print "Processing file $file_list[$file_index]\n";
    while(<IN_FILE>){
            my $citing_pat=get_number();
            get_country($citing_pat);
    }
    $file_index=$file_index+1;
}
close IN_FILE;
close OUT_FILE;

The definition of get_number is below.

sub get_number {
    while(<IN_FILE>){
        if(/DID/){
            my @fields=split / /;
            chomp($fields[3]);
            if($fields[3] !~ /\D/){
                return $fields[3];
            }
        }
    }
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

迷鸟归林 2024-11-25 07:23:32

Perl 有一个 变量 $_ ,它有点像很多东西的默认垃圾场。

get_number 中,while(){ 正在将一行读入 $_,下一行正在检查 $_ 匹配正则表达式 DID

当没有给出参数时,chomp; 也可以对 $_ 进行操作,这也是很常见的。

Perl has a variable $_ that is sort of the default dumping ground for a lot of things.

In get_number, while(<IN_FILE>){ is reading a line into $_, and the next line is checking if $_ matches the regular expression DID.

It's also common to see chomp; which also operates on $_ when no argument is given.

幽梦紫曦~ 2024-11-25 07:23:32

在这种情况下,if (/DID/) 默认情况下会搜索 $_ 变量,因此它是正确的。然而,在我看来,它是一个相当宽松的正则表达式。

子程序中的 while 循环可能是必要的,这取决于您的输入是什么样的。您应该意识到两个 while 循环将导致某些行被完全跳过。

主程序中的 while 循环将占用一行,并且不对其执行任何操作。基本上,这意味着文件中的第一行以及紧跟在匹配行之后的每一行(例如包含“DID”且第四个字段是数字的行)也将被丢弃。

为了正确回答这个问题,我们需要查看输入文件。

这段代码存在很多问题,如果它能按预期工作,那可能是因为运气好。

下面是代码的清理版本。我保留了这些模块,因为我不知道它们是否在其他地方使用。我还保留了输出文件,因为它可能会在您未显示的地方使用。此代码不会尝试使用 get_country 的未定义值,并且如果找不到合适的数字,则不会执行任何操作。

use warnings;
use strict;
use Scalar::Util qw/ looks_like_number /;
use WWW::Mechanize;

my @file_list=qw{ blah.txt };

open(my $outfile, '>', 'outputfile.txt') or die "Can't open output file: $!";

for my $file (@file_list) {
    open(my $in_file, '<', $file) or die "Can't read source file: $!";
    print "Processing file $file\n";
    while (my $citing_pat = get_number($in_file)) {
        get_country($citing_pat);
    }
}
close $out_file;

sub get_number {
    my $fh = shift;
     while(<$fh>) {
            if (/DID/) {
                    my $field = (split)[3];
                    if($field =~ /^\d+$/){
                return $field;
                    }
            }
     }
    return undef;
}

In that case, if (/DID/) by default searches the $_ variable, so it is correct. However, it is a rather loose regex, IMO.

The while loop in the sub may be necessary, it depends on what your input looks like. You should be aware that the two while loops will cause some lines to get completely skipped.

The while loop in the main program will take one line, and do nothing with it. Basically, this means that the first line in the file, and every line directly following a matching line (e.g. a line that contains "DID" and the 4th field is a number), will also be discarded.

In order to answer that question properly, we'd need to see the input file.

There are a number of issues with this code, and if it works as intended, it's probably due to a healthy amount of luck.

Below is a cleaned up version of the code. I kept the modules in, since I do not know if they are used elsewhere. I also kept the output file, since it might be used somewhere you have not shown. This code will not attempt to use undefined values for get_country, and will simply do nothing if it does not find a suitable number.

use warnings;
use strict;
use Scalar::Util qw/ looks_like_number /;
use WWW::Mechanize;

my @file_list=qw{ blah.txt };

open(my $outfile, '>', 'outputfile.txt') or die "Can't open output file: $!";

for my $file (@file_list) {
    open(my $in_file, '<', $file) or die "Can't read source file: $!";
    print "Processing file $file\n";
    while (my $citing_pat = get_number($in_file)) {
        get_country($citing_pat);
    }
}
close $out_file;

sub get_number {
    my $fh = shift;
     while(<$fh>) {
            if (/DID/) {
                    my $field = (split)[3];
                    if($field =~ /^\d+$/){
                return $field;
                    }
            }
     }
    return undef;
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文