在 Perl 中从哈希中搜索子字符串匹配

发布于 2024-10-21 19:56:38 字数 1171 浏览 1 评论 0原文

我有一个文件，其中包含需要在给定字符串中匹配的子字符串。这些给定的字符串取自另一个具有实际数据的文件。这是 csv 文件中的一列。如果给定的字符串具有任何这些子字符串，它将被标记为 TRUE。 Perl 最好的方法是什么？

到目前为止我所做的就是这样的。似乎仍然存在一些问题：

#!/usr/bin/perl

use warnings;
use strict;

if ($#ARGV+1 != 1) {
 print "usage: $0 inputfilename\n";
 exit;
}

our $inputfile = $ARGV[0];
our $outputfile = "$inputfile" . '.ads';
our $ad_file = "C:/test/easylist.txt";  
our %ads_list_hash = ();

our $lines = 0;

# Create a list of substrings in the easylist.txt file
 open ADS, "$ad_file" or die "can't open $ad_file";
 while(<ADS>) {
        chomp;
        $ads_list_hash{$lines} = $_;
        $lines ++;
 }  

 for(my $count = 0; $count < $lines; $count++) {
            print "$ads_list_hash{$count}\n";
       }
 open IN,"$inputfile" or die "can't open $inputfile";       
 while(<IN>) {      
       chomp;       
       my @hhfile = split /,/;       
       for(my $count = 0; $count < $lines; $count++) {
            print "$hhfile[10]\t$ads_list_hash{$count}\n";

            if($hhfile[9] =~ /$ads_list_hash{$count}/) {
                print "TRUE !\n";
                last;
            }
       }
 }

 close IN;

原文

I have a file that has the substrings that I need to match in a given string. These given strings are taken from another file which has the actual data. This is a column in a csv file. If the given string has any of these substrings it will be marked as TRUE. What is the best way to do this is Perl?

What I've done so far is something like this. There still seem to be some issues:

#!/usr/bin/perl

use warnings;
use strict;

if ($#ARGV+1 != 1) {
 print "usage: $0 inputfilename\n";
 exit;
}

our $inputfile = $ARGV[0];
our $outputfile = "$inputfile" . '.ads';
our $ad_file = "C:/test/easylist.txt";  
our %ads_list_hash = ();

our $lines = 0;

# Create a list of substrings in the easylist.txt file
 open ADS, "$ad_file" or die "can't open $ad_file";
 while(<ADS>) {
        chomp;
        $ads_list_hash{$lines} = $_;
        $lines ++;
 }  

 for(my $count = 0; $count < $lines; $count++) {
            print "$ads_list_hash{$count}\n";
       }
 open IN,"$inputfile" or die "can't open $inputfile";       
 while(<IN>) {      
       chomp;       
       my @hhfile = split /,/;       
       for(my $count = 0; $count < $lines; $count++) {
            print "$hhfile[10]\t$ads_list_hash{$count}\n";

            if($hhfile[9] =~ /$ads_list_hash{$count}/) {
                print "TRUE !\n";
                last;
            }
       }
 }

 close IN;

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

赠我空喜 2024-10-28 19:56:38

请参阅 Text::CSV - 逗号-分隔值操纵器，例如

use 5.010;
use Text::CSV;
use Data::Dumper;
my @rows;
my %match;
my @substrings = qw/Hello Stack overflow/;
my $csv = Text::CSV->new ( { binary => 1 } )  # should set binary attribute.
                 or die "Cannot use CSV: ".Text::CSV->error_diag ();
open my $fh, "<:encoding(utf8)", "test.csv" or die "test.csv: $!";
while ( my $row = $csv->getline( $fh ) ) {
        if($row->[0] ~~ @substrings){ # 1st field 
            say "match " ;
            $match{$row->[0]} = 1;
        }
 }
$csv->eof or $csv->error_diag();
close $fh;
print Dumper(\%match);

see Text::CSV - comma-separated values manipulator like

use 5.010;
use Text::CSV;
use Data::Dumper;
my @rows;
my %match;
my @substrings = qw/Hello Stack overflow/;
my $csv = Text::CSV->new ( { binary => 1 } )  # should set binary attribute.
                 or die "Cannot use CSV: ".Text::CSV->error_diag ();
open my $fh, "<:encoding(utf8)", "test.csv" or die "test.csv: $!";
while ( my $row = $csv->getline( $fh ) ) {
        if($row->[0] ~~ @substrings){ # 1st field 
            say "match " ;
            $match{$row->[0]} = 1;
        }
 }
$csv->eof or $csv->error_diag();
close $fh;
print Dumper(\%match);

回复收藏 0 原文

岛徒 2024-10-28 19:56:38

您可以使用 selectcol_arrayref 或 fetchrow_* 和循环来获取要搜索的单词数组。然后通过将该数组与 '\b)|(?:\b' 连接并用 '(?:\b' 和 '\b)' （或更适合您需求的东西）包围来构建正则表达式模式。

回复收藏 0 原文

笑忘罢 2024-10-28 19:56:38

下面是一些经过清理的代码，其功能与您发布的代码相同，不同之处在于它在测试每个广告模式之前不会将 $hhfile[10] 与每个广告模式一起打印；如果您需要该输出，那么您将必须循环所有模式并以与您已经做的基本相同的方式单独测试每个模式。（尽管如此，即使在这种情况下，如果您的循环是 for my $count (0 .. $lines) 而不是 C 风格的 for (...;. ..;...).)

我没有单独测试每个模式，而是使用 Regexp::Assemble，它将构建一个单一模式，相当于一次测试所有单独的子字符串。 Nikhil Jain 的答案中的智能匹配运算符 (~~) 在使用时将执行基本相同的操作，如他的答案中所示，但它需要 Perl 5.10 或更高版本，而 Regexp::Assemble 仍然适用于如果你使用的是 5.8 或（但愿不会！）5.6。

#!/usr/bin/env perl

use warnings;
use strict;

use Regexp::Assemble;

die "usage: $0 inputfilename\n" unless @ARGV == 1;

my $inputfile     = $ARGV[0];
my $outputfile    = $inputfile . '.ads';
my $ad_file       = "C:/test/easylist.txt";
my @ad_list;

# Create a list of substrings in the easylist.txt file
open my $ads_fh, '<', $ad_file or die "can't open $ad_file: $!";
while (<$ads_fh>) {
    chomp;
    push @ad_list, $_;
}

for (@ad_list) {
    print "$_\n";       # Or just "print;" - the $_ will be assumed
}      

my $ra = Regexp::Assemble->new;
$ra->add(@ad_list);

open my $in_fh, '<', $inputfile or die "can't open $inputfile: $!";
while (<$in_fh>) {
    my @hhfile = split /,/;
    print "TRUE !\n" if $ra->match($hhfile[9]);
}

（根据 perl -c，代码在语法上是有效的，但尚未经过其他测试。）

Here's some cleaned-up code which will do the same thing as the code you posted, with the exception that it does not print $hhfile[10] along with each ad pattern before testing them; if you need that output, then you're going to have to loop over all the patterns and test each one individually in basically the same way that you were already doing. (Although, even in that case, it would be better if your loops were for my $count (0 .. $lines) instead of the C-style for (...;...;...).)

Instead of testing each pattern individually, I've used Regexp::Assemble, which will build a single pattern which is equivalent to testing all of the individual substrings at once. The smart match operator (~~) in Nikhil Jain's answer will do basically the same thing when used as shown in his answer, but it requires Perl 5.10 or later, while Regexp::Assemble will still work for you if you're on 5.8 or (heaven forbid!) 5.6.

#!/usr/bin/env perl

use warnings;
use strict;

use Regexp::Assemble;

die "usage: $0 inputfilename\n" unless @ARGV == 1;

my $inputfile     = $ARGV[0];
my $outputfile    = $inputfile . '.ads';
my $ad_file       = "C:/test/easylist.txt";
my @ad_list;

# Create a list of substrings in the easylist.txt file
open my $ads_fh, '<', $ad_file or die "can't open $ad_file: $!";
while (<$ads_fh>) {
    chomp;
    push @ad_list, $_;
}

for (@ad_list) {
    print "$_\n";       # Or just "print;" - the $_ will be assumed
}      

my $ra = Regexp::Assemble->new;
$ra->add(@ad_list);

open my $in_fh, '<', $inputfile or die "can't open $inputfile: $!";
while (<$in_fh>) {
    my @hhfile = split /,/;
    print "TRUE !\n" if $ra->match($hhfile[9]);
}

(Code is syntactically valid, according to perl -c, but has not been tested beyond that.)

回复收藏 0 原文

~没有更多了~