如何查找两个不同文件中两个子字符串的数量和位置？

发布于 2024-10-15 11:39:23 字数 1261 浏览 2 评论 0原文

从给定的两个序列中，我需要检查每三个密码子，如果更改与以下列表中的相同，那么我必须检查更改的位置和更改的密码子并计算它们出现的次数。

例如：

sequence 1 - TTCAUUUCCCAU
sequence 2 - TTTAUAUCGCAC

我需要获得的输出是

TTC->TTT considered/location-1/count-1
AUU->AUA considered/location-2/count-1
UCC->UCG considered/location-3/count-1

注意：不考虑 CAU->CAC 因为它不在以下列表中。列表：->还应考虑变化的方向。

first sequence->second sequence
TTC->TTT
CTG->UUA
AUU->AUA
GUG->GUA
UCC->UCG
CCC->CCG
ACC->ACG
GCC->GCG
UAC->UAU
UGA->UAG
CAC->CAU
CAG->CAA
AAC->AAU
AAG->AAA
GAC->GAU
GAG->GAA
UGC->UGU
CGG->CGU
AGC->AGU
AGG->CGU
AGA->CGU
UAA->UAG
GGC->GGU

到目前为止我写的代码是：

print "Enter the sequence:";
$a = <>;

print "Enter the mutated sequence:";
$b = <>;

chomp($a);
chomp($b);

my @codon = split(/(\w{3})/, $a);
my @codon1 = split(/(\w{3})/, $b);

open(OUT, ">output.txt") or die;
$count = 0;
@new = ();
@new1 = ();
for ($i = 0; $i <= $#codon; $i++) {
    for ($j = 0; $j <= $#codon1; $j++) {
        if ($codon[$i] = {TTC}) || ($codon1[$j] = {TTT}) {
            $count++;
        }
    }
}
print OUT " @new";
close OUT;

原文

From two sequences given I need to check for every three codons and if the changes are same as in the following list, then I have to check out the location of changes and the codons which are changed and count their number of occurrences.

For example:

sequence 1 - TTCAUUUCCCAU
sequence 2 - TTTAUAUCGCAC

The output which I need to get is

TTC->TTT considered/location-1/count-1
AUU->AUA considered/location-2/count-1
UCC->UCG considered/location-3/count-1

NOTE: CAU->CAC not considered because it is not there in the following list. LIST:-> The direction of changes should also be considered.

first sequence->second sequence
TTC->TTT
CTG->UUA
AUU->AUA
GUG->GUA
UCC->UCG
CCC->CCG
ACC->ACG
GCC->GCG
UAC->UAU
UGA->UAG
CAC->CAU
CAG->CAA
AAC->AAU
AAG->AAA
GAC->GAU
GAG->GAA
UGC->UGU
CGG->CGU
AGC->AGU
AGG->CGU
AGA->CGU
UAA->UAG
GGC->GGU

The code which I have written until now is:

print "Enter the sequence:";
$a = <>;

print "Enter the mutated sequence:";
$b = <>;

chomp($a);
chomp($b);

my @codon = split(/(\w{3})/, $a);
my @codon1 = split(/(\w{3})/, $b);

open(OUT, ">output.txt") or die;
$count = 0;
@new = ();
@new1 = ();
for ($i = 0; $i <= $#codon; $i++) {
    for ($j = 0; $j <= $#codon1; $j++) {
        if ($codon[$i] = {TTC}) || ($codon1[$j] = {TTT}) {
            $count++;
        }
    }
}
print OUT " @new";
close OUT;

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

柠栀 2024-10-22 11:39:23

#!/usr/bin/env perl
use strict;

my %seq_map = (
    "TTC"=>"TTT",
    "CTG"=>"UUA",
    "AUU"=>"AUA",
    "GUG"=>"GUA",
    "UCC"=>"UCG",
    "CCC"=>"CCG",
    "ACC"=>"ACG",
    "GCC"=>"GCG",
    "UAC"=>"UAU",
    "UGA"=>"UAG",
    "CAC"=>"CAU",
    "CAG"=>"CAA",
    "AAC"=>"AAU",
    "AAG"=>"AAA",
    "GAC"=>"GAU",
    "GAG"=>"GAA",
    "UGC"=>"UGU",
    "CGG"=>"CGU",
    "AGC"=>"AGU",
    "AGG"=>"CGU",
    "AGA"=>"CGU",
    "UAA"=>"UAG",
    "GGC"=>"GGU"
);

my %seq_count = ();

my $seq1 = "TTCAUUUCCCAU";
my $seq2 = "TTTAUAUCGCAC";

my $max = int(length($seq1) / 3);
for(my $i=0;$i<$max;$i++) {
    my $c1 = substr($seq1, $i*3, 3);
    my $c2 = substr($seq2, $i*3, 3);
    my $found = $seq_map{$c1};
    if ($found && ($found eq $c2)) {
        $seq_count{$c1} ||= 0;
        my $count = ++$seq_count{$c1};
        my $loc = $i+1;
        print "${c1}->${c2} considered / location ${loc} / count ${count}\n";
    }
}

#!/usr/bin/env perl
use strict;

my %seq_map = (
    "TTC"=>"TTT",
    "CTG"=>"UUA",
    "AUU"=>"AUA",
    "GUG"=>"GUA",
    "UCC"=>"UCG",
    "CCC"=>"CCG",
    "ACC"=>"ACG",
    "GCC"=>"GCG",
    "UAC"=>"UAU",
    "UGA"=>"UAG",
    "CAC"=>"CAU",
    "CAG"=>"CAA",
    "AAC"=>"AAU",
    "AAG"=>"AAA",
    "GAC"=>"GAU",
    "GAG"=>"GAA",
    "UGC"=>"UGU",
    "CGG"=>"CGU",
    "AGC"=>"AGU",
    "AGG"=>"CGU",
    "AGA"=>"CGU",
    "UAA"=>"UAG",
    "GGC"=>"GGU"
);

my %seq_count = ();

my $seq1 = "TTCAUUUCCCAU";
my $seq2 = "TTTAUAUCGCAC";

my $max = int(length($seq1) / 3);
for(my $i=0;$i<$max;$i++) {
    my $c1 = substr($seq1, $i*3, 3);
    my $c2 = substr($seq2, $i*3, 3);
    my $found = $seq_map{$c1};
    if ($found && ($found eq $c2)) {
        $seq_count{$c1} ||= 0;
        my $count = ++$seq_count{$c1};
        my $loc = $i+1;
        print "${c1}->${c2} considered / location ${loc} / count ${count}\n";
    }
}

回复收藏 0 原文

半世蒼涼 2024-10-22 11:39:23

有很多方法可以实现这一点，就像 Perl 中的典型情况一样。

如果文件不大，您可以将文件逐行读入数组（或者如果它已经是每行一个条目，则只需将整个文件放入数组中）。然后使用 while 循环（和第二个文件的文件句柄）来比较二核苷酸的位置。

因为这是一个生物信息学问题，而且文件通常很大，所以我会聪明地研究从每个文件句柄中逐行读取并进行比较。

对于您尝试执行的 3 个字符拆分，我将使用 for 循环，直到您要检查的字符串长度除以 3 -1。然后创建一个正则表达式，继续获取前三个字母，然后是下一个字母，依此类推......

类似于 /\d{$count}(\w{3})/

>while 循环可能看起来像这样：

#!/usr/bin/perl -w
use strict;

open FILE1, "file1.txt" or die "Cannot open file1.txt: $!\n";
open FILE2, "file2.txt" or die "Cannot open file2.txt: $!\n";

my $count = 0;
while (<FILE1>) {
    chomp(my $lineF1 = $_);
    chomp(my $lineF2 = <FILE2>);

    # some changes may need to be made to this if statement
    if ($lineF1 eq $lineF2) {
        # do something important here
        print "$lineF1\n";
    } else {
        print "Line $count mismatch\n";
    }
    $count++;
}

close(FILE1);
close(FILE2);

There are many ways to accomplish this, as is the case typically in Perl.

If the file is not large, you can read in the file line by line into an array (or if it is already one entry per line, then just slurp the whole file into an array). Then use a while loop (and the second file's file handle) to compare the position of the dinucleotides.

Because this is a bioinformatics problem, and the files are typically large, I would be smart and look into reading from each file handle, line by line, and doing comparasons.

For the 3 character split you are trying to do, I would use a for loop, going until the length of the string you are checking divided by 3 -1. Then create a regex as you go on to grab the first three letters, then the next, and so on…

Something like /\d{$count}(\w{3})/

The while loop could look something like this:

#!/usr/bin/perl -w
use strict;

open FILE1, "file1.txt" or die "Cannot open file1.txt: $!\n";
open FILE2, "file2.txt" or die "Cannot open file2.txt: $!\n";

my $count = 0;
while (<FILE1>) {
    chomp(my $lineF1 = $_);
    chomp(my $lineF2 = <FILE2>);

    # some changes may need to be made to this if statement
    if ($lineF1 eq $lineF2) {
        # do something important here
        print "$lineF1\n";
    } else {
        print "Line $count mismatch\n";
    }
    $count++;
}

close(FILE1);
close(FILE2);

回复收藏 0 原文

久夏青 2024-10-22 11:39:23

你能认为两个文件中的密码子是“对齐的”吗？如果是这种情况，问题很简单：您在 2 级哈希中加载有效转换列表：

 # of course, you load this from a file...
 $transitions{TTC}{TTT} = 1;
 $transitions{CTG}{UUA} = 1;
 ...

然后，逐行读取这两个文件（或者它们只是一个字符串？）：

# of course, I'm leaving out all the file manipulation...
my $line1 = <FILE1>;
my $line2 = <FILE2>;

my $maxlen1 = length($line1);
my $maxlen2 = length($line2);
my $i = 0;

while($i < $maxlen1 && $i < $maxlen2){
  my $codon1 = substr($line1, $i, $i+3);
  if(exists($transitions{$codon1}){
    my $codon2 = substr($line2, $i, $i+3);
    if(exists($transitions{$codon1}{$codon2}){
      print "we have a match $codon1 -> $codon2 at index $i\n";
    }
  }
  $i += 3;
}

注意使用 'exists() 而不是 Defined()，因为它会节省一些额外的计算。如果您不想使用下一个 if()，您可以计算 $codon1 和 $codon2，然后检查 if(exists($transitions{$codon1}{$codon2})) {} 使用 'exists' 可以避免自动生存问题...

Can you consider that the codons in the two files are "aligned"? If that's the case, the problem is simple: you load the list of valid transitions in a 2-level hash:

 # of course, you load this from a file...
 $transitions{TTC}{TTT} = 1;
 $transitions{CTG}{UUA} = 1;
 ...

Then, reading both files, line by line (or are they just one string?):

# of course, I'm leaving out all the file manipulation...
my $line1 = <FILE1>;
my $line2 = <FILE2>;

my $maxlen1 = length($line1);
my $maxlen2 = length($line2);
my $i = 0;

while($i < $maxlen1 && $i < $maxlen2){
  my $codon1 = substr($line1, $i, $i+3);
  if(exists($transitions{$codon1}){
    my $codon2 = substr($line2, $i, $i+3);
    if(exists($transitions{$codon1}{$codon2}){
      print "we have a match $codon1 -> $codon2 at index $i\n";
    }
  }
  $i += 3;
}

NOTE use 'exists() instead of defined() as it will save you some extra computation. If you don't want to have nexted if(), you can compute $codon1 and $codon2 and then check for if(exists($transitions{$codon1}{$codon2})) {} Using 'exists' avoids the autovivification problem...

回复收藏 0 原文

~没有更多了~