生成具有取代率的合成 DNA 序列

发布于 2024-07-14 13:02:23 字数 908 浏览 6 评论 0原文

给定这些输入:

my $init_seq = "AAAAAAAAAA" #length 10 bp 
my $sub_rate = 0.003;
my $nof_tags = 1000;
my @dna = qw( A C G T );

我想生成:

  1. 一千个长度为 10 的标签

  2. 标签中每个位置的替换率为 0.003

产生如下输出:

AAAAAAAAAA
AATAACAAAA
.....
AAGGAAAAGA # 1000th tags

是否有一种紧凑的方法可以在 Perl 中做到这一点?

我坚持这个脚本的核心逻辑:

#!/usr/bin/perl

my $init_seq = "AAAAAAAAAA" #length 10 bp 
my $sub_rate = 0.003;
my $nof_tags = 1000;
my @dna = qw( A C G T );

    $i = 0;
    while ($i < length($init_seq)) {
        $roll = int(rand 4) + 1;       # $roll is now an integer between 1 and 4

        if ($roll == 1) {$base = A;}
        elsif ($roll == 2) {$base = T;}
        elsif ($roll == 3) {$base = C;}
        elsif ($roll == 4) {$base = G;};

        print $base;
    }
    continue {
        $i++;
    }

Given these inputs:

my $init_seq = "AAAAAAAAAA" #length 10 bp 
my $sub_rate = 0.003;
my $nof_tags = 1000;
my @dna = qw( A C G T );

I want to generate:

  1. One thousand length-10 tags

  2. Substitution rate for each position in a tag is 0.003

Yielding output like:

AAAAAAAAAA
AATAACAAAA
.....
AAGGAAAAGA # 1000th tags

Is there a compact way to do it in Perl?

I am stuck with the logic of this script as core:

#!/usr/bin/perl

my $init_seq = "AAAAAAAAAA" #length 10 bp 
my $sub_rate = 0.003;
my $nof_tags = 1000;
my @dna = qw( A C G T );

    $i = 0;
    while ($i < length($init_seq)) {
        $roll = int(rand 4) + 1;       # $roll is now an integer between 1 and 4

        if ($roll == 1) {$base = A;}
        elsif ($roll == 2) {$base = T;}
        elsif ($roll == 3) {$base = C;}
        elsif ($roll == 4) {$base = G;};

        print $base;
    }
    continue {
        $i++;
    }

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

遗弃M 2024-07-21 13:02:23

作为一个小的优化,将:替换

    $roll = int(rand 4) + 1;       # $roll is now an integer between 1 and 4

    if ($roll == 1) {$base = A;}
    elsif ($roll == 2) {$base = T;}
    elsif ($roll == 3) {$base = C;}
    elsif ($roll == 4) {$base = G;};

    $base = $dna[int(rand 4)];

As a small optimisation, replace:

    $roll = int(rand 4) + 1;       # $roll is now an integer between 1 and 4

    if ($roll == 1) {$base = A;}
    elsif ($roll == 2) {$base = T;}
    elsif ($roll == 3) {$base = C;}
    elsif ($roll == 4) {$base = G;};

with

    $base = $dna[int(rand 4)];
我要还你自由 2024-07-21 13:02:23

编辑:假设替换率在 0.001 到 1.000 范围内:

以及 $roll ,生成 [1..1000] 范围内的另一个(伪)随机数,如果它小于或等于到 (1000 * $sub_rate) 然后执行替换,否则不执行任何操作(即输出“A”)。

请注意,除非随机数生成器的属性已知,否则您可能会引入微妙的偏差。

EDIT: Assuming substitution rate is in the range 0.001 to 1.000:

As well as $roll, generate another (pseudo)random number in the range [1..1000], if it is less than or equal to (1000 * $sub_rate) then perform the substitution, otherwise do nothing (i.e. output 'A').

Be aware that you may introduce subtle bias unless the properties of your random number generator are known.

不顾 2024-07-21 13:02:23

不完全是你要找的,但我建议你看看 BioPerl 的 Bio::SeqEvolution::DNAPoint 模块。 但它不将突变率作为参数。 相反,它询问与您想要的原始序列同一性的下限是多少。

use strict;
use warnings;
use Bio::Seq;
use Bio::SeqEvolution::Factory;

my $seq = Bio::Seq->new(-seq => 'AAAAAAAAAA', -alphabet => 'dna');

my $evolve = Bio::SeqEvolution::Factory->new (
   -rate     => 2,      # transition/transversion rate
   -seq      => $seq
   -identity => 50      # At least 50% identity with the original
);


my @mutated;
for (1..1000) { push @mutated, $evolve->next_seq }

所有 1000 个突变序列将存储在 @mutated 数组中,它们的序列可以通过 seq 方法访问。

Not exactly what you are looking for, but I suggest you take a look at BioPerl's Bio::SeqEvolution::DNAPoint module. It does not take mutation rate as a parameter though. Rather, it asks what the lower bound of sequence identity with the original you want.

use strict;
use warnings;
use Bio::Seq;
use Bio::SeqEvolution::Factory;

my $seq = Bio::Seq->new(-seq => 'AAAAAAAAAA', -alphabet => 'dna');

my $evolve = Bio::SeqEvolution::Factory->new (
   -rate     => 2,      # transition/transversion rate
   -seq      => $seq
   -identity => 50      # At least 50% identity with the original
);


my @mutated;
for (1..1000) { push @mutated, $evolve->next_seq }

All 1000 mutated sequences will be stored in the @mutated array, their sequences can be accessed via the seq method.

戏剧牡丹亭 2024-07-21 13:02:23

如果发生替换,您希望从可能性中排除当前碱基

my @other_bases = grep { $_ ne substr($init_seq, $i, 1) } @dna;
$base = @other_bases[int(rand 3)];

另请参阅Mitch Wheat 对于如何实现替代率的回答

In the event of a substitution, you want to exclude the current base from the possibilities:

my @other_bases = grep { $_ ne substr($init_seq, $i, 1) } @dna;
$base = @other_bases[int(rand 3)];

Also please see Mitch Wheat's answer for how to implement the substitution rate.

唔猫 2024-07-21 13:02:23

我不知道我是否理解正确,但我会做这样的事情(伪代码):

digits = 'ATCG'
base = 'AAAAAAAAAA'
MAX = 1000
for i = 1 to len(base)
  # check if we have to mutate
  mutate = 1+rand(MAX) <= rate*MAX
  if mutate then
    # find current A:0 T:1 C:2 G:3
    current = digits.find(base[i])
    # get a new position 
    # but ensure that it is not current
    new = (j+1+rand(3)) mod 4        
    base[i] = digits[new]
  end if
end for

I don't know if I understand correctly but I'd do something like this (pseudocode):

digits = 'ATCG'
base = 'AAAAAAAAAA'
MAX = 1000
for i = 1 to len(base)
  # check if we have to mutate
  mutate = 1+rand(MAX) <= rate*MAX
  if mutate then
    # find current A:0 T:1 C:2 G:3
    current = digits.find(base[i])
    # get a new position 
    # but ensure that it is not current
    new = (j+1+rand(3)) mod 4        
    base[i] = digits[new]
  end if
end for
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文