字符串匹配搜索

发布于 2025-01-01 16:15:42 字数 1377 浏览 1 评论 0原文

一个像这样的文本文件作为查询文件：

fooLONGcite
GetmoreDATA
stringMATCH
GOODthing

另一个像这样的文本文件作为主题文件：

sometingfooLONGcite
anyotherfooLONGcite
matchGetmoreDATA
GETGOODthing
brotherGETDATA
CITEMORETHING
TOOLONGSTUFFETC

预期的结果将从主题文件中获取匹配的字符串，然后将其打印出来。所以，输出应该是：

sometingfooLONGcite
anyotherfooLONGcite
matchGetmoreDATA    
GETGOODthing

这是我的 perl 脚本。但这不起作用。你能帮我看看问题出在哪里吗？谢谢。

#!/usr/bin/perl
use strict;

# to check the command line option
if($#ARGV<0){
    printf("Usage: \n <tag> <seq> <outfile>\n");
    exit 1;
}

# to open the given infile file
open(tag, $ARGV[0]) or die "Cannot open the file $ARGV[0]";
open(seq, $ARGV[1]) or die "Cannot open the file $ARGV[1]";

my %seqhash = ();
my $tag_id;
my $tag_seq;
my $seq_id;
my $seq_seq;
my $seq;
my $i = 0;

print "Processing cds seq\n";
#check the seq file
while(<seq>){ 
    my @line = split;
    if($i != 0){
        $seqhash{$seq_seq} = $seq;
        $seq = "";
        print "$seq_seq\n";
    }
    $seq_seq = $line[0];
    $i++;
}

while(<tag>){ 
    my @tagline = split; 
    $tag_seq = $tagline[0];
    $seq = $seqhash{$seq_seq};
    #print "$tag_seq\n";
    print "$seq\n";
    #print output ">$id\n$seq\n";
}
#print "Ending of Processing gff\n";

close(tag);
close(seq);

原文

one text file like this as query file:

fooLONGcite
GetmoreDATA
stringMATCH
GOODthing

another text file like this as subject file:

sometingfooLONGcite
anyotherfooLONGcite
matchGetmoreDATA
GETGOODthing
brotherGETDATA
CITEMORETHING
TOOLONGSTUFFETC

The expected result will be get the matched string from subject file and then print it out. So, the output should be:

sometingfooLONGcite
anyotherfooLONGcite
matchGetmoreDATA    
GETGOODthing

Here is my perl script. But It doesn't work. Can you help me find where is the problem? Thanks.

#!/usr/bin/perl
use strict;

# to check the command line option
if($#ARGV<0){
    printf("Usage: \n <tag> <seq> <outfile>\n");
    exit 1;
}

# to open the given infile file
open(tag, $ARGV[0]) or die "Cannot open the file $ARGV[0]";
open(seq, $ARGV[1]) or die "Cannot open the file $ARGV[1]";

my %seqhash = ();
my $tag_id;
my $tag_seq;
my $seq_id;
my $seq_seq;
my $seq;
my $i = 0;

print "Processing cds seq\n";
#check the seq file
while(<seq>){ 
    my @line = split;
    if($i != 0){
        $seqhash{$seq_seq} = $seq;
        $seq = "";
        print "$seq_seq\n";
    }
    $seq_seq = $line[0];
    $i++;
}

while(<tag>){ 
    my @tagline = split; 
    $tag_seq = $tagline[0];
    $seq = $seqhash{$seq_seq};
    #print "$tag_seq\n";
    print "$seq\n";
    #print output ">$id\n$seq\n";
}
#print "Ending of Processing gff\n";

close(tag);
close(seq);

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

忆离笙 2025-01-08 16:15:42

据我了解，您寻找的是字符串的一部分的匹配，而不是精确的匹配。这是一个我认为您正在寻找的脚本：

script.pl 的内容。我考虑到查询文件很小，因为我将其所有内容添加到正则表达式中：

use warnings;
use strict;

## Check arguments.
die qq[Usage: perl $0 <query_file> <subject_file>\n] unless @ARGV == 2;

## Open input files. Abort if found errors.
open my $fh_query, qq[<], shift @ARGV or die qq[Cannot open input file: $!\n];
open my $fh_subject, qq[<], shift @ARGV or die qq[Cannot open input file: $!\n];

## Variable to save a regex with alternations of the content of the 'query' file.
my $query_regex;

{
    ## Read content of the 'query' file in slurp mode.
    local $/ = undef;
    my $query_content = <$fh_query>;

    ## Remove trailing spaces and generate a regex.
    $query_content =~ s/\s+\Z//;
    $query_content =~ s/\n/|/g;
    $query_regex = qr/(?i:($query_content))/;
}

## Read 'subject' file and for each line compare if that line matches with 
## any word of the 'query' file and print in success.
while ( <$fh_subject> ) { 
    if ( m/$query_regex/o ) { 
        print
    }   
}

运行脚本：

perl script.pl query.txt subject.txt

结果：

sometingfooLONGcite
anyotherfooLONGcite
matchGetmoreDATA
GETGOODthing

As I understand, you look for a match of part of the string, not an exact one. Here a script that does what I think you are looking for:

Content of script.pl. I take into account that file of queries is small because I add all its content to a regex:

use warnings;
use strict;

## Check arguments.
die qq[Usage: perl $0 <query_file> <subject_file>\n] unless @ARGV == 2;

## Open input files. Abort if found errors.
open my $fh_query, qq[<], shift @ARGV or die qq[Cannot open input file: $!\n];
open my $fh_subject, qq[<], shift @ARGV or die qq[Cannot open input file: $!\n];

## Variable to save a regex with alternations of the content of the 'query' file.
my $query_regex;

{
    ## Read content of the 'query' file in slurp mode.
    local $/ = undef;
    my $query_content = <$fh_query>;

    ## Remove trailing spaces and generate a regex.
    $query_content =~ s/\s+\Z//;
    $query_content =~ s/\n/|/g;
    $query_regex = qr/(?i:($query_content))/;
}

## Read 'subject' file and for each line compare if that line matches with 
## any word of the 'query' file and print in success.
while ( <$fh_subject> ) { 
    if ( m/$query_regex/o ) { 
        print
    }   
}

Run the script:

perl script.pl query.txt subject.txt

And result:

sometingfooLONGcite
anyotherfooLONGcite
matchGetmoreDATA
GETGOODthing

回复收藏 0 原文

要走就滚别墨迹 2025-01-08 16:15:42

您当前的代码没有多大意义；您甚至引用了未分配任何内容的变量。

您需要做的就是将第一个文件读入散列，然后根据该散列检查第二个文件的每一行。

while (my $line = <FILE>)
{
    chomp($line);
    $hash{$line} = 1;
}

...

while (my $line = <FILE2>)
{
    chomp($line);
    if (defined $hash{$line})
    {
        print "$line\n";
    }
}

Your current code doesn't make a lot of sense; you're even referencing variables you don't assign anything to.

All you need to do is read the first file into a hash, then check each line of the second against that hash.

while (my $line = <FILE>)
{
    chomp($line);
    $hash{$line} = 1;
}

...

while (my $line = <FILE2>)
{
    chomp($line);
    if (defined $hash{$line})
    {
        print "$line\n";
    }
}

回复收藏 0 原文

~没有更多了~