如何查找两个不同文件中两个子字符串的数量和位置?
从给定的两个序列中,我需要检查每三个密码子,如果更改与以下列表中的相同,那么我必须检查更改的位置和更改的密码子并计算它们出现的次数。
例如:
sequence 1 - TTCAUUUCCCAU
sequence 2 - TTTAUAUCGCAC
我需要获得的输出是
TTC->TTT considered/location-1/count-1
AUU->AUA considered/location-2/count-1
UCC->UCG considered/location-3/count-1
注意:不考虑 CAU->CAC
因为它不在以下列表中。列表:->还应考虑变化的方向。
first sequence->second sequence
TTC->TTT
CTG->UUA
AUU->AUA
GUG->GUA
UCC->UCG
CCC->CCG
ACC->ACG
GCC->GCG
UAC->UAU
UGA->UAG
CAC->CAU
CAG->CAA
AAC->AAU
AAG->AAA
GAC->GAU
GAG->GAA
UGC->UGU
CGG->CGU
AGC->AGU
AGG->CGU
AGA->CGU
UAA->UAG
GGC->GGU
到目前为止我写的代码是:
print "Enter the sequence:";
$a = <>;
print "Enter the mutated sequence:";
$b = <>;
chomp($a);
chomp($b);
my @codon = split(/(\w{3})/, $a);
my @codon1 = split(/(\w{3})/, $b);
open(OUT, ">output.txt") or die;
$count = 0;
@new = ();
@new1 = ();
for ($i = 0; $i <= $#codon; $i++) {
for ($j = 0; $j <= $#codon1; $j++) {
if ($codon[$i] = {TTC}) || ($codon1[$j] = {TTT}) {
$count++;
}
}
}
print OUT " @new";
close OUT;
From two sequences given I need to check for every three codons and if the changes are same as in the following list, then I have to check out the location of changes and the codons which are changed and count their number of occurrences.
For example:
sequence 1 - TTCAUUUCCCAU
sequence 2 - TTTAUAUCGCAC
The output which I need to get is
TTC->TTT considered/location-1/count-1
AUU->AUA considered/location-2/count-1
UCC->UCG considered/location-3/count-1
NOTE: CAU->CAC
not considered because it is not there in the following list. LIST:-> The direction of changes should also be considered.
first sequence->second sequence
TTC->TTT
CTG->UUA
AUU->AUA
GUG->GUA
UCC->UCG
CCC->CCG
ACC->ACG
GCC->GCG
UAC->UAU
UGA->UAG
CAC->CAU
CAG->CAA
AAC->AAU
AAG->AAA
GAC->GAU
GAG->GAA
UGC->UGU
CGG->CGU
AGC->AGU
AGG->CGU
AGA->CGU
UAA->UAG
GGC->GGU
The code which I have written until now is:
print "Enter the sequence:";
$a = <>;
print "Enter the mutated sequence:";
$b = <>;
chomp($a);
chomp($b);
my @codon = split(/(\w{3})/, $a);
my @codon1 = split(/(\w{3})/, $b);
open(OUT, ">output.txt") or die;
$count = 0;
@new = ();
@new1 = ();
for ($i = 0; $i <= $#codon; $i++) {
for ($j = 0; $j <= $#codon1; $j++) {
if ($codon[$i] = {TTC}) || ($codon1[$j] = {TTT}) {
$count++;
}
}
}
print OUT " @new";
close OUT;
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
有很多方法可以实现这一点,就像 Perl 中的典型情况一样。
如果文件不大,您可以将文件逐行读入数组(或者如果它已经是每行一个条目,则只需将整个文件放入数组中)。然后使用 while 循环(和第二个文件的文件句柄)来比较二核苷酸的位置。
因为这是一个生物信息学问题,而且文件通常很大,所以我会聪明地研究从每个文件句柄中逐行读取并进行比较。
对于您尝试执行的 3 个字符拆分,我将使用
for
循环,直到您要检查的字符串长度除以 3 -1。然后创建一个正则表达式,继续获取前三个字母,然后是下一个字母,依此类推......类似于
/\d{$count}(\w{3})/
>while 循环可能看起来像这样:
There are many ways to accomplish this, as is the case typically in Perl.
If the file is not large, you can read in the file line by line into an array (or if it is already one entry per line, then just slurp the whole file into an array). Then use a
while
loop (and the second file's file handle) to compare the position of the dinucleotides.Because this is a bioinformatics problem, and the files are typically large, I would be smart and look into reading from each file handle, line by line, and doing comparasons.
For the 3 character split you are trying to do, I would use a
for
loop, going until the length of the string you are checking divided by 3 -1. Then create a regex as you go on to grab the first three letters, then the next, and so on…Something like
/\d{$count}(\w{3})/
The
while
loop could look something like this:你能认为两个文件中的密码子是“对齐的”吗?如果是这种情况,问题很简单:您在 2 级哈希中加载有效转换列表:
然后,逐行读取这两个文件(或者它们只是一个字符串?):
注意 使用 'exists() 而不是 Defined(),因为它会节省一些额外的计算。如果您不想使用下一个 if(),您可以计算 $codon1 和 $codon2,然后检查 if(exists($transitions{$codon1}{$codon2})) {} 使用 'exists' 可以避免自动生存问题...
Can you consider that the codons in the two files are "aligned"? If that's the case, the problem is simple: you load the list of valid transitions in a 2-level hash:
Then, reading both files, line by line (or are they just one string?):
NOTE use 'exists() instead of defined() as it will save you some extra computation. If you don't want to have nexted if(), you can compute $codon1 and $codon2 and then check for if(exists($transitions{$codon1}{$codon2})) {} Using 'exists' avoids the autovivification problem...