在 Perl 中从哈希中搜索子字符串匹配
我有一个文件,其中包含需要在给定字符串中匹配的子字符串。这些给定的字符串取自另一个具有实际数据的文件。这是 csv 文件中的一列。如果给定的字符串具有任何这些子字符串,它将被标记为 TRUE。 Perl 最好的方法是什么?
到目前为止我所做的就是这样的。似乎仍然存在一些问题:
#!/usr/bin/perl
use warnings;
use strict;
if ($#ARGV+1 != 1) {
print "usage: $0 inputfilename\n";
exit;
}
our $inputfile = $ARGV[0];
our $outputfile = "$inputfile" . '.ads';
our $ad_file = "C:/test/easylist.txt";
our %ads_list_hash = ();
our $lines = 0;
# Create a list of substrings in the easylist.txt file
open ADS, "$ad_file" or die "can't open $ad_file";
while(<ADS>) {
chomp;
$ads_list_hash{$lines} = $_;
$lines ++;
}
for(my $count = 0; $count < $lines; $count++) {
print "$ads_list_hash{$count}\n";
}
open IN,"$inputfile" or die "can't open $inputfile";
while(<IN>) {
chomp;
my @hhfile = split /,/;
for(my $count = 0; $count < $lines; $count++) {
print "$hhfile[10]\t$ads_list_hash{$count}\n";
if($hhfile[9] =~ /$ads_list_hash{$count}/) {
print "TRUE !\n";
last;
}
}
}
close IN;
I have a file that has the substrings that I need to match in a given string. These given strings are taken from another file which has the actual data. This is a column in a csv file. If the given string has any of these substrings it will be marked as TRUE. What is the best way to do this is Perl?
What I've done so far is something like this. There still seem to be some issues:
#!/usr/bin/perl
use warnings;
use strict;
if ($#ARGV+1 != 1) {
print "usage: $0 inputfilename\n";
exit;
}
our $inputfile = $ARGV[0];
our $outputfile = "$inputfile" . '.ads';
our $ad_file = "C:/test/easylist.txt";
our %ads_list_hash = ();
our $lines = 0;
# Create a list of substrings in the easylist.txt file
open ADS, "$ad_file" or die "can't open $ad_file";
while(<ADS>) {
chomp;
$ads_list_hash{$lines} = $_;
$lines ++;
}
for(my $count = 0; $count < $lines; $count++) {
print "$ads_list_hash{$count}\n";
}
open IN,"$inputfile" or die "can't open $inputfile";
while(<IN>) {
chomp;
my @hhfile = split /,/;
for(my $count = 0; $count < $lines; $count++) {
print "$hhfile[10]\t$ads_list_hash{$count}\n";
if($hhfile[9] =~ /$ads_list_hash{$count}/) {
print "TRUE !\n";
last;
}
}
}
close IN;
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
请参阅 Text::CSV - 逗号-分隔值操纵器,例如
see Text::CSV - comma-separated values manipulator like
您可以使用 selectcol_arrayref 或 fetchrow_* 和循环来获取要搜索的单词数组。然后通过将该数组与 '\b)|(?:\b' 连接并用 '(?:\b' 和 '\b)' (或更适合您需求的东西)包围来构建正则表达式模式。
You can use selectcol_arrayref or fetchrow_* and a loop to get an array of the words to search for. Then build the regex pattern by joining that array with '\b)|(?:\b' and embracing with '(?:\b' and '\b)' (or something better suited to your needs).
下面是一些经过清理的代码,其功能与您发布的代码相同,不同之处在于它在测试每个广告模式之前不会将
$hhfile[10]
与每个广告模式一起打印;如果您需要该输出,那么您将必须循环所有模式并以与您已经做的基本相同的方式单独测试每个模式。 (尽管如此,即使在这种情况下,如果您的循环是for my $count (0 .. $lines)
而不是 C 风格的for (...;. ..;...)
.)我没有单独测试每个模式,而是使用 Regexp::Assemble,它将构建一个单一模式,相当于一次测试所有单独的子字符串。 Nikhil Jain 的答案中的智能匹配运算符 (
~~
) 在使用时将执行基本相同的操作,如他的答案中所示,但它需要 Perl 5.10 或更高版本,而 Regexp::Assemble 仍然适用于如果你使用的是 5.8 或(但愿不会!)5.6。(根据
perl -c
,代码在语法上是有效的,但尚未经过其他测试。)Here's some cleaned-up code which will do the same thing as the code you posted, with the exception that it does not print
$hhfile[10]
along with each ad pattern before testing them; if you need that output, then you're going to have to loop over all the patterns and test each one individually in basically the same way that you were already doing. (Although, even in that case, it would be better if your loops werefor my $count (0 .. $lines)
instead of the C-stylefor (...;...;...)
.)Instead of testing each pattern individually, I've used Regexp::Assemble, which will build a single pattern which is equivalent to testing all of the individual substrings at once. The smart match operator (
~~
) in Nikhil Jain's answer will do basically the same thing when used as shown in his answer, but it requires Perl 5.10 or later, while Regexp::Assemble will still work for you if you're on 5.8 or (heaven forbid!) 5.6.(Code is syntactically valid, according to
perl -c
, but has not been tested beyond that.)