Perl 代码，列出文本文件中给定字符串后面的所有单词

发布于 2024-09-15 08:23:19 字数 589 浏览 2 评论 0原文

这很难描述，但对于提取我正在处理的输出中的数据很有用（我希望将此代码用于大量目的）

下面是一个示例：假设我有一个包含单词和一些特殊字符（$、#、! 等）的文本文件，内容如下：

blah blah
blah 将该词添加到列表中：1234.56 blah blah
巴拉巴拉
等等，现在不要忘记将这个词添加到列表中：PINAPPLE blah blah
为了获得奖励积分，
很高兴知道该脚本
将能够将此单词添加到列表中：1!@#$%^&*()[]{};:'",<.>/?asdf blah blah
如示例所示，

我想将任何“单词”（定义为在此上下文中不包含空格的任何字符串）添加到某种形式的列表中，以便我可以将列表的元素提取为 list[2] 列表[3] 或 list(4) list(5) 或类似的内容。

这将是非常通用的，并且在另一个线程和另一个论坛中进行一些质疑之后，我希望将它放在 perl 中可以使其执行速度相对较快——因此即使对于大型文本文件它也能很好地工作。我打算用它来从不同程序生成的输出文件中读取数据，而不管输出文件的结构如何，即如果我知道要搜索的字符串，我就可以获取数据。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

有木有妳兜一样 2024-09-22 08:23:19

我认为你的问题中有一些遗漏的词:)
但这听起来像是您想要的（假设即使是“大文本文件”也适合内存 - 如果不是，您将逐行循环推送到 $list 上）。

my $filecontents = File::Slurp::read_file("filename");
@list = $filecontents =~ /add this word to the list: (\S+)/g;

I think there are some missing words in your question :)
But this sounds like what you want (assuming even the "large text files" fit in memory - if not, you'd loop through line by line pushing onto $list instead).

my $filecontents = File::Slurp::read_file("filename");
@list = $filecontents =~ /add this word to the list: (\S+)/g;

回复收藏 0 原文

烟雨扶苏 2024-09-22 08:23:19

如果搜索的字符串相同，则让Perl进行处理，将搜索短语作为输入记录分隔符：

open my $fh, '<', 'test.dat' or die "can't open $!"; # usual way of opening a file

my @list;                                            # declare empty array 'list' (results)
$/= 'add this word to the list:';                    # define custom input  record seperator

while( <$fh> ) {                                     # read records one by one
   push @list, $1 if /(\S\S*)/
}
close $fh;                                           # thats it, close file!

print join "\n", @list;                              # this will list the results

上面是“几乎可以”，它会保存$list[0] 中文件的第一个单词，因为
的处理方式。但这种方式很容易理解（恕我直言）

blah                 <== first word of the file
1234.56
PINAPPLE
1!@#$%^&*()[]{};:'",<.>/?asdf

问：为什么不简单地用一个正则表达式在整个数据上查找字符串（正如这里已经建议的那样）。因为根据我的经验，使用每条记录正则表达式（在实际用例中可能非常复杂的正则表达式）进行记录处理会更快 - 特别是在非常大的文件上。这就是原因。

真实世界测试

为了支持这一说法，我使用包含 10,000 个数据的 200MB 数据文件进行了一些测试。
你的标记。测试源如下：

use strict;
use warnings;
use Benchmark qw(timethese cmpthese);
use FILE::Slurp;
# 'data.dat', a 200MB data file, containing 10_000
# markers: 'add this word to the list:' and a
# one of different data items after each.

my $t = timethese(10,
 {
  'readline+regex' => sub { # trivial reading line-by-line
                     open my $fh, '<', 'data.dat' or die "can't open $!"; 
                     my @list;                                            
                     while(<$fh>) { 
                        push @list,$1 if /add this word to the list:\s*(\S+)/
                     }
                     close $fh;                                           
                     return scalar @list;   
                  },
  'readIRS+regex' => sub { # treat each 'marker' as start of an input record
                     open my $fh, '<', 'data.dat' or die "can't open $!"; 
                     $/= 'add this word to the list:';    # new IRS                
                     my @list;                                            
                     while(<$fh>) { push @list, $1 if /(\S+)/ }       
                     close $fh;                                           
                     return scalar @list;   
                  },
  'slurp+regex' => sub { # read the whole file and apply regular expression
                     my $filecontents = File::Slurp::read_file('data.dat');
                     my @list = $filecontents =~ /add this word to the list:\s*(\S+)/g;
                     return scalar @list;
                  },
 }
);
cmpthese( $t ) ;

输出以下计时结果：

Benchmark: timing 10 iterations of readIRS+regex, readline+regex, slurp+regex...
readIRS+regex: 43 wallclock secs (37.11 usr +  5.48 sys = 42.59 CPU) @  0.23/s (n=10)
readline+regex: 42 wallclock secs (36.47 usr +  5.49 sys = 41.96 CPU) @  0.24/s (n=10)
slurp+regex: 142 wallclock secs (135.85 usr +  4.98 sys = 140.82 CPU) @  0.07/s (n=10)
               s/iter    slurp+regex  readIRS+regex readline+regex
slurp+regex      14.1             --           -70%           -70%
readIRS+regex    4.26           231%             --            -1%
readline+regex   4.20           236%             1%             --

这基本上意味着自定义 IRS 的简单逐行读取和逐块读取
比通过常规方式读取文件和扫描大约快 2.3 倍（大约 4 秒内通过一次）
表达。

这基本上是说，如果您在像我这样的系统上处理这种大小的文件;-)，
您应该逐行阅读如果您的搜索问题位于一行并阅读
通过自定义输入记录分隔符如果您的搜索问题涉及多行（我的 0.02 美元）。

你也想参加测试吗？这一个：

use strict;
use warnings;

sub getsomerandomtext {
    my ($s, $n) = ('', (shift));
    while($n --> 0) {
        $s .= chr( rand(80) + 30 );
        $s .= "\n" if rand($n) < $n/10
    }
    $s x 10
}

my @stuff = (
 q{1234.56}, q{PINEAPPLE}, q{1!@#$%^&*()[]{};:'",<.>/?asdf}
);

my $fn = 'data.dat';
open my $fh, '>', $fn or die $!;

my $phrase='add this word to the list:';
my $x = 10000;

while($x --> 0) {
   print $fh
      getsomerandomtext(1000),  ' ',
      $phrase, ' ', $stuff[int(rand(@stuff))],  ' ',
      getsomerandomtext(1000), "\n",
}

close $fh;
print "done.\n";

创建 200MB 输入文件“data.dat”。

问候

rbo

If the string for the searches is the same, let Perl do the processing by using the search phrase as input record separator:

open my $fh, '<', 'test.dat' or die "can't open $!"; # usual way of opening a file

my @list;                                            # declare empty array 'list' (results)
$/= 'add this word to the list:';                    # define custom input  record seperator

while( <$fh> ) {                                     # read records one by one
   push @list, $1 if /(\S\S*)/
}
close $fh;                                           # thats it, close file!

print join "\n", @list;                              # this will list the results

The above is "almost ok", it will save the first word of the file in $list[0] because
of the way of the processing. But this way makes it very easy to comprehend (imho)

blah                 <== first word of the file
1234.56
PINAPPLE
1!@#$%^&*()[]{};:'",<.>/?asdf

Q: why not simply look the strings up with one regex over the entire data (as has already been suggested here). Because in my experience, the record-wise procesing with per-record regular expression (probably very complicated regex in a real use case) will be faster - especially on very large files. Thats the reason.

Real world test

To back this claim up, I performed some tests with a 200MB data file containing 10,000 of
your markers. The test source follows:

use strict;
use warnings;
use Benchmark qw(timethese cmpthese);
use FILE::Slurp;
# 'data.dat', a 200MB data file, containing 10_000
# markers: 'add this word to the list:' and a
# one of different data items after each.

my $t = timethese(10,
 {
  'readline+regex' => sub { # trivial reading line-by-line
                     open my $fh, '<', 'data.dat' or die "can't open $!"; 
                     my @list;                                            
                     while(<$fh>) { 
                        push @list,$1 if /add this word to the list:\s*(\S+)/
                     }
                     close $fh;                                           
                     return scalar @list;   
                  },
  'readIRS+regex' => sub { # treat each 'marker' as start of an input record
                     open my $fh, '<', 'data.dat' or die "can't open $!"; 
                     $/= 'add this word to the list:';    # new IRS                
                     my @list;                                            
                     while(<$fh>) { push @list, $1 if /(\S+)/ }       
                     close $fh;                                           
                     return scalar @list;   
                  },
  'slurp+regex' => sub { # read the whole file and apply regular expression
                     my $filecontents = File::Slurp::read_file('data.dat');
                     my @list = $filecontents =~ /add this word to the list:\s*(\S+)/g;
                     return scalar @list;
                  },
 }
);
cmpthese( $t ) ;

which outputs the following timing results:

Benchmark: timing 10 iterations of readIRS+regex, readline+regex, slurp+regex...
readIRS+regex: 43 wallclock secs (37.11 usr +  5.48 sys = 42.59 CPU) @  0.23/s (n=10)
readline+regex: 42 wallclock secs (36.47 usr +  5.49 sys = 41.96 CPU) @  0.24/s (n=10)
slurp+regex: 142 wallclock secs (135.85 usr +  4.98 sys = 140.82 CPU) @  0.07/s (n=10)
               s/iter    slurp+regex  readIRS+regex readline+regex
slurp+regex      14.1             --           -70%           -70%
readIRS+regex    4.26           231%             --            -1%
readline+regex   4.20           236%             1%             --

which basically means that the simple line-wise reading and the block-wise reading by custom IRS
are about 2.3 times faster (one pass in ~4 sec) than slurping the file and scanning by regular
expression.

This basically says, that if you are processing files of this size on a system like mine ;-),
you should read line-by-line if your search problem is located on one line and read
by custom input record separator if your search problem involves more than one line (my $0.02).

Want to make the test too? This one:

use strict;
use warnings;

sub getsomerandomtext {
    my ($s, $n) = ('', (shift));
    while($n --> 0) {
        $s .= chr( rand(80) + 30 );
        $s .= "\n" if rand($n) < $n/10
    }
    $s x 10
}

my @stuff = (
 q{1234.56}, q{PINEAPPLE}, q{1!@#$%^&*()[]{};:'",<.>/?asdf}
);

my $fn = 'data.dat';
open my $fh, '>', $fn or die $!;

my $phrase='add this word to the list:';
my $x = 10000;

while($x --> 0) {
   print $fh
      getsomerandomtext(1000),  ' ',
      $phrase, ' ', $stuff[int(rand(@stuff))],  ' ',
      getsomerandomtext(1000), "\n",
}

close $fh;
print "done.\n";

creates the 200MB input file 'data.dat'.

Regards

rbo

回复收藏 0 原文

苏大泽ㄣ 2024-09-22 08:23:19

怎么样：

my(@list);
my $rx = qr/.*add this word to the list: +(\S+)/;
while (<>)
{
     while (m/$rx/)
     {
          push @list, $1;
          s/$rx//;
     }
}

这允许包含多个“添加”标记的长行。如果肯定只能有一个，请将内部的 while 替换为 if。（当然，除了我使用了贪婪的“.*”，它将所有内容都抓取到最后一次出现的匹配项...）

my(@list);
my $rx = qr/(?:.*?)add this word to the list: +(\S+)/;
while (<>)
{
     while (m/$rx/)
     {
          push @list, $1;
          s/$rx//;
     }
}

带有可选择标记：

my $marker = "add this word to the list:";
my(@list);
my $rx = qr/(?:.*?)$marker\s+(\S+)/;
while (<>)
{
     while (m/$rx/)
     {
          push @list, $1;
          s/$rx//;
     }
}

不重复：

my $marker = "add this word to the list:";
my(%hash);
my(@list);
my $rx = qr/(?:.*?)$marker\s+(\S+)/;
while (<>)
{
     while (m/$rx/)
     {
          push @list, $1 unless defined $hash{$1};
          $hash{$1} = 1;
          s/$rx//;
     }
}

等等。

并且，正如 @ysth 指出的，你（我）不需要替换 - Perl DWIM 正确地在内循环中进行 g 限定匹配：

#!/bin/perl -w
use strict;
my(@list);
my(%hash);
my($marker) = "add this word to the list:";
my $rx = qr/(?:.*?)$marker\s+(\S+)/;
while (<>)
{
    while (m/$rx/g)
    {
        push @list, $1 unless defined $hash{$1};
        $hash{$1} = 1;
    }
}

foreach my $i (@list)
{
    print "$i\n";
}

How about:

my(@list);
my $rx = qr/.*add this word to the list: +(\S+)/;
while (<>)
{
     while (m/$rx/)
     {
          push @list, $1;
          s/$rx//;
     }
}

This allows for long lines containing more than one of the 'add' markers. If there definitively can only be one, replace the inner while with if. (Except, of course, that I used a greedy '.*' which snaffles up everything to the last occurrence of the match...)

my(@list);
my $rx = qr/(?:.*?)add this word to the list: +(\S+)/;
while (<>)
{
     while (m/$rx/)
     {
          push @list, $1;
          s/$rx//;
     }
}

With a selectable marker:

my $marker = "add this word to the list:";
my(@list);
my $rx = qr/(?:.*?)$marker\s+(\S+)/;
while (<>)
{
     while (m/$rx/)
     {
          push @list, $1;
          s/$rx//;
     }
}

With no repeats:

my $marker = "add this word to the list:";
my(%hash);
my(@list);
my $rx = qr/(?:.*?)$marker\s+(\S+)/;
while (<>)
{
     while (m/$rx/)
     {
          push @list, $1 unless defined $hash{$1};
          $hash{$1} = 1;
          s/$rx//;
     }
}

Etc.

And, as @ysth points out, you (I) don't need the substitution - Perl DWIM's correctly a g-qualified match in the inner loop:

#!/bin/perl -w
use strict;
my(@list);
my(%hash);
my($marker) = "add this word to the list:";
my $rx = qr/(?:.*?)$marker\s+(\S+)/;
while (<>)
{
    while (m/$rx/g)
    {
        push @list, $1 unless defined $hash{$1};
        $hash{$1} = 1;
    }
}

foreach my $i (@list)
{
    print "$i\n";
}

回复收藏 0 原文

~没有更多了~