Perl 程序使用引用、哈希表和子集来模拟限制性内切酶

发布于 2024-10-04 23:48:40 字数 1298 浏览 0 评论 0原文

我是 Perl 入门课程的学生。我正在寻找有关如何完成任务的建议。我的教授鼓励论坛。任务是:

编写一个 Perl 程序,该程序将从命令行获取两个文件:一个酶文件和一个 DNA 文件。使用限制性内切酶读取文件并将限制性内切酶应用于 DNA 文件。

输出将是按照 DNA 文件中出现的顺序排列的 DNA 片段。输出文件的名称应通过将限制性酶的名称附加到 DNA 文件的名称来构建,并在它们之间使用下划线。

例如,如果酶是 EcoRI 并且 DNA 文件名为 BC161026,则输出文件应命名为 BC161026_EcoRI。

我的方法是创建一个主程序和两个子程序,如下所示:

主要: 不确定如何将我的潜艇绑在一起?

子程序$DNA: 获取 DNA 文件并删除所有新行以生成单个字符串

子程序酶: 读取并存储来自命令行的酶文件中的行 以将酶首字母缩略词与剪切位置分开的方式解析文件。 将剪切位置作为正则表达式存储在哈希表中 将首字母缩略词的名称存储在哈希表中

酶文件格式注意事项: 酶文件遵循称为 Staden 的格式。示例:

AatI/AGG'CCT//
AatII/GACGT'C//
AbsI/CC'TCGAGG//

酶首字母缩略词由第一个斜杠之前的字符组成(AatI,在第一个示例中。识别序列是第一个和第二个斜杠之间的所有内容(AGG'CCT,同样,在第一个示例中)。切点是在识别序列中用撇号表示 酶内 DNA 的标准缩写如下:

R = G 或 A B = 非 A(C 或 G 或 T) 等等...

除了对主要块的建议之外,您是否看到我遗漏的任何缺失部分?您能否推荐您认为对修补该程序有用的特定工具?

输入酶示例:TryII/RRR'TTT//

要读取的示例字符串:CCCCCCGGGTTTCCCCCCCCCCCCAAATTTCCCCCCCCCCCCAGATTTCCCCCCCCCCGAGTTTCCCCC

输出应为:

CCCCCCGGGG

TTTCCCCCCCCCCCCAAA

TTTCCCCCCCCCCCCAGA

TTTCCCCCCCCCCGAG

TTCCCCCC

I'm a student in an intro Perl class. I'm looking for suggestions on how to approach an assignment. My professor encourages forums. The assignment is:

Write a Perl program that will take two files from the command line, an enzyme file and a DNA file. Read the file with restriction enzymes and apply the restriction enzymes to the DNA file.

The output will be fragments of DNA arranged in the order they occur in the dna file. The name of the output files should be constructed by appending the name of the restriction enzyme to the name of the DNA file, with an underscore between them.

For example, if the enzyme is EcoRI and the DNA file is named BC161026, the output file should be named BC161026_EcoRI.

My approach is to create a main program and two subs as follows:

Main:
Not sure how to tie my subs together?

Sub program $DNA:
Take a DNA file and remove any new lines to make a single string

Sub program Enzymes:
Read and store the lines from the enzyme file which is from the command line
Parse the file in a way that it separates the enzyme acronym from the position of the cut.
Store the position of the cut as a regular expression in a hash table
Store the name of the acronym in a hash table

Note on enzyme file format:
The enzyme file follows a format known as Staden. Examples:

AatI/AGG'CCT//
AatII/GACGT'C//
AbsI/CC'TCGAGG//

The enzyme acronym consists of the characters before the first slash (AatI, in the first example. The recognition sequence is everything between the first and second slash (AGG'CCT, again, in the first example). The cut point is denoted by an apostrophe in the recognition sequence
There are standard abbreviations for dna within enzymes as follows:

R = G or A
B = not A (C or G or T)
etc...

Along with a recommendation for a main chunk, do you see any missing pieces that I've omitted? Can you recommend specific tools that you think would be useful in patching this program together?

Example input enzyme: TryII/RRR'TTT//

Example string to read: CCCCCCGGGTTTCCCCCCCCCCCCAAATTTCCCCCCCCCCCCAGATTTCCCCCCCCCCGAGTTTCCCCC

The output should be:

CCCCCCGGG

TTTCCCCCCCCCCCCAAA

TTTCCCCCCCCCCCCAGA

TTTCCCCCCCCCCGAG

TTTCCCCC

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

与风相奔跑 2024-10-11 23:48:40

好吧,我知道我不应该只做你的作业,但是这个有一些有趣的技巧,所以我玩了一下。从中学习,而不是仅仅复制。我评论得不太好,有不明白的地方请追问。这其中有一些微妙的魔力,如果你在课堂上没有讲过,你的教授会知道,所以一定要理解。

#!/usr/bin/env perl

use strict;
use warnings;

use Getopt::Long;

my ($enzyme_file, $dna_file);
my $write_output = 0;
my $verbose = 0;
my $help = 0;
GetOptions(
  'enzyme=s' => \$enzyme_file,
  'dna=s' => \$dna_file,
  'output' => \$write_output,
  'verbose' => \$verbose,
  'help' => \$help
);

$help = 1 unless ($dna_file && $enzyme_file);
help() if $help; # exits

# 'Main'
my $dna = getDNA($dna_file);
my %enzymes = %{ getEnzymes($enzyme_file) }; # A function cannot return a hash, so return a hashref and then store the referenced hash
foreach my $enzyme (keys %enzymes) {
  print "Applying enzyme " . $enzyme . " gives:\n";
  my $dna_holder = $dna;
  my ($precut, $postcut) = ($enzymes{$enzyme}{'precut'}, $enzymes{$enzyme}{'postcut'});

  my $R = qr/[GA]/;
  my $B = qr/[CGT]/;

  $precut =~ s/R/${R}/g;
  $precut =~ s/B/${B}/g;
  $postcut =~ s/R/${R}/g;
  $postcut =~ s/B/${B}/g;
  print "\tPre-Cut pattern: " . $precut . "\n" if $verbose;
  print "\tPost-Cut pattern: " . $postcut . "\n" if $verbose;

  #while(1){
  #  if ($dna_holder =~ s/(.*${precut})(${postcut}.*)/$1/ ) {
  #    print "\tFound section:" . $2 . "\n" if $verbose;
  #    print "\tRemaining DNA: " . $1 . "\n" if $verbose;
  #    unshift @{ $enzymes{$enzyme}{'cut_dna'} }, $2;
  #  } else {
  #    unshift @{ $enzymes{$enzyme}{'cut_dna'} }, $dna_holder;
  #    print "\tNo more cuts.\n" if $verbose;
  #    print "\t" . $_ . "\n" for @{ $enzymes{$enzyme}{'cut_dna'} };
  #    last;
  #  }
  #}
  unless ($dna_holder =~ s/(${precut})(${postcut})/$1'$2/g) {
    print "\tHas no effect on given strand\n" if $verbose;
  }
  @{ $enzymes{$enzyme}{'cut_dna'} } = split(/'/, $dna_holder);
  print "\t$_\n" for @{ $enzymes{$enzyme}{'cut_dna'} };

  writeOutput($dna_file, $enzyme, $enzymes{$enzyme}{'cut_dna'}) if $write_output; #Note that $enzymes{$enzyme}{'cut_dna'} is an arrayref already
  print "\n";
}

sub getDNA {
  my ($dna_file) = @_;

  open(my $dna_handle, '<', $dna_file) or die "Cannot open file $dna_file";
  my @dna_array = <$dna_handle>;
  chomp(@dna_array);

  my $dna = join('', @dna_array);

  print "Using DNA:\n" . $dna . "\n\n" if $verbose;
  return $dna;
}

sub getEnzymes {
  my ($enzyme_file) = @_;
  my %enzymes;

  open(my $enzyme_handle, '<', $enzyme_file) or die "Cannot open file $enzyme_file";
  while(<$enzyme_handle>) {
    chomp;
    if(m{([^/]*)/([^']*)'([^/]*)//}) {
      print "Found Enzyme " . $1 . ":\n\tPre-cut: " . $2 . "\n\tPost-cut: " . $3 . "\n" if $verbose;
      $enzymes{$1} = {
        precut => $2,
        postcut => $3,
        cut_dna => [] #Added to show the empty array that will hold the cut DNA sections
      };
    }
  }

  print "\n" if $verbose;
  return \%enzymes;
}

sub writeOutput {

  my ($dna_file, $enzyme, $cut_dna_ref) = @_;

  my $outfile = $dna_file . '_' . $enzyme;
  print "\tSaving data to $outfile\n" if $verbose; 
  open(my $outfile_handle, '>', $outfile) or die "Cannot open $outfile for writing";

  print $outfile_handle $_ . "\n" for @{ $cut_dna_ref };
}

sub help {

  my $filename = (split('/', $0))[-1];

  my $enzyme_text = <<'END';
AatI/AGG'CCT//
AatII/GACGT'C//
AbsI/CC'TCGAGG//
TryII/RRR'TTT//
Test/AAA'TTT//
END

  my $dna_text = <<'END';
CCCCCCGGGTTTCCCCCCC
CCCCCAAATTTCCCCCCCCCCCCAGATTTC
CCCCCCCCCGAGTTTCCCCC
END

  print <<END;
Usage: 
    $filename --enzyme (-e) <enzyme-filename> --dna (-d) <dna-filename> [options] (files may come in either order)
    $filename -h    (shows this help)

Options: 
    --verbose (-v)  print additional (debugging) information
    --output (-o)   output final data to files


Files:
The DNA file contains one DNA string which may be broken over many lines. E.G.:

$dna_text

The enzymes file constains enzyme definitions, one per line. E.G.:

$enzyme_text
END

exit;
}

编辑:明确添加了 cut_dna 初始化,因为这是每种酶的最终结果持有者,所以我认为更清楚地看到它会很好。

编辑2:添加了输出例程、调用、标志和相应的帮助。

编辑 3:更改了主例程,以结合 Canavanin 的最佳方法,同时删除循环。现在它是一个全局替换,添加临时剪切标记('),然后将剪切标记拆分到数组中。留下旧方法作为注释,新方法是下面的 5 行。

编辑 4:用于写入多个文件的附加测试用例。 (以下)

my @names = ('cat','dog','sheep'); 
foreach my $name (@names) { #$name is lexical, ie dies after each loop
  open(my $handle, '>', $name); #open a lexical handle for the file, also dies each loop
  print $handle $name; #write to the handle
  #$handles closes automatically when it "goes out of scope"
}

Ok, I know I shouldn't just do your homework, but there were some fun tricks to this one, so I played with it. Learn from this, not just copy. I didn't comment very well, so if there is something you don't understand, please ask. There is some slight magic in this that if you haven't covered it in your class, your prof will know, so be sure you understand.

#!/usr/bin/env perl

use strict;
use warnings;

use Getopt::Long;

my ($enzyme_file, $dna_file);
my $write_output = 0;
my $verbose = 0;
my $help = 0;
GetOptions(
  'enzyme=s' => \$enzyme_file,
  'dna=s' => \$dna_file,
  'output' => \$write_output,
  'verbose' => \$verbose,
  'help' => \$help
);

$help = 1 unless ($dna_file && $enzyme_file);
help() if $help; # exits

# 'Main'
my $dna = getDNA($dna_file);
my %enzymes = %{ getEnzymes($enzyme_file) }; # A function cannot return a hash, so return a hashref and then store the referenced hash
foreach my $enzyme (keys %enzymes) {
  print "Applying enzyme " . $enzyme . " gives:\n";
  my $dna_holder = $dna;
  my ($precut, $postcut) = ($enzymes{$enzyme}{'precut'}, $enzymes{$enzyme}{'postcut'});

  my $R = qr/[GA]/;
  my $B = qr/[CGT]/;

  $precut =~ s/R/${R}/g;
  $precut =~ s/B/${B}/g;
  $postcut =~ s/R/${R}/g;
  $postcut =~ s/B/${B}/g;
  print "\tPre-Cut pattern: " . $precut . "\n" if $verbose;
  print "\tPost-Cut pattern: " . $postcut . "\n" if $verbose;

  #while(1){
  #  if ($dna_holder =~ s/(.*${precut})(${postcut}.*)/$1/ ) {
  #    print "\tFound section:" . $2 . "\n" if $verbose;
  #    print "\tRemaining DNA: " . $1 . "\n" if $verbose;
  #    unshift @{ $enzymes{$enzyme}{'cut_dna'} }, $2;
  #  } else {
  #    unshift @{ $enzymes{$enzyme}{'cut_dna'} }, $dna_holder;
  #    print "\tNo more cuts.\n" if $verbose;
  #    print "\t" . $_ . "\n" for @{ $enzymes{$enzyme}{'cut_dna'} };
  #    last;
  #  }
  #}
  unless ($dna_holder =~ s/(${precut})(${postcut})/$1'$2/g) {
    print "\tHas no effect on given strand\n" if $verbose;
  }
  @{ $enzymes{$enzyme}{'cut_dna'} } = split(/'/, $dna_holder);
  print "\t$_\n" for @{ $enzymes{$enzyme}{'cut_dna'} };

  writeOutput($dna_file, $enzyme, $enzymes{$enzyme}{'cut_dna'}) if $write_output; #Note that $enzymes{$enzyme}{'cut_dna'} is an arrayref already
  print "\n";
}

sub getDNA {
  my ($dna_file) = @_;

  open(my $dna_handle, '<', $dna_file) or die "Cannot open file $dna_file";
  my @dna_array = <$dna_handle>;
  chomp(@dna_array);

  my $dna = join('', @dna_array);

  print "Using DNA:\n" . $dna . "\n\n" if $verbose;
  return $dna;
}

sub getEnzymes {
  my ($enzyme_file) = @_;
  my %enzymes;

  open(my $enzyme_handle, '<', $enzyme_file) or die "Cannot open file $enzyme_file";
  while(<$enzyme_handle>) {
    chomp;
    if(m{([^/]*)/([^']*)'([^/]*)//}) {
      print "Found Enzyme " . $1 . ":\n\tPre-cut: " . $2 . "\n\tPost-cut: " . $3 . "\n" if $verbose;
      $enzymes{$1} = {
        precut => $2,
        postcut => $3,
        cut_dna => [] #Added to show the empty array that will hold the cut DNA sections
      };
    }
  }

  print "\n" if $verbose;
  return \%enzymes;
}

sub writeOutput {

  my ($dna_file, $enzyme, $cut_dna_ref) = @_;

  my $outfile = $dna_file . '_' . $enzyme;
  print "\tSaving data to $outfile\n" if $verbose; 
  open(my $outfile_handle, '>', $outfile) or die "Cannot open $outfile for writing";

  print $outfile_handle $_ . "\n" for @{ $cut_dna_ref };
}

sub help {

  my $filename = (split('/', $0))[-1];

  my $enzyme_text = <<'END';
AatI/AGG'CCT//
AatII/GACGT'C//
AbsI/CC'TCGAGG//
TryII/RRR'TTT//
Test/AAA'TTT//
END

  my $dna_text = <<'END';
CCCCCCGGGTTTCCCCCCC
CCCCCAAATTTCCCCCCCCCCCCAGATTTC
CCCCCCCCCGAGTTTCCCCC
END

  print <<END;
Usage: 
    $filename --enzyme (-e) <enzyme-filename> --dna (-d) <dna-filename> [options] (files may come in either order)
    $filename -h    (shows this help)

Options: 
    --verbose (-v)  print additional (debugging) information
    --output (-o)   output final data to files


Files:
The DNA file contains one DNA string which may be broken over many lines. E.G.:

$dna_text

The enzymes file constains enzyme definitions, one per line. E.G.:

$enzyme_text
END

exit;
}

Edit: Added cut_dna initialization explicitly because this is the final result holder for each enzyme, so I thought it would be good to see it more clearly.

Edit 2: Added output routine, call, flag and corresponding help.

Edit 3: Changed main routine to incorporate the best of canavanin's method while removing loops. Now its a global replace to add temporary cut mark (') and then split on cut mark into array. Left old method as comment, new method is the 5 lines following.

Edit 4: Additional test case for writing to multiple files. (Below)

my @names = ('cat','dog','sheep'); 
foreach my $name (@names) { #$name is lexical, ie dies after each loop
  open(my $handle, '>', $name); #open a lexical handle for the file, also dies each loop
  print $handle $name; #write to the handle
  #$handles closes automatically when it "goes out of scope"
}
折戟 2024-10-11 23:48:40

请注意,在酶中,当您将酶存储在哈希中时,酶的名称应该是键,位点应该是值(因为原则上两种酶可以具有相同的位点)。

在 Main 例程中,您可以迭代哈希;对于每种酶产生一个输出文件。最直接的方法是将位点翻译为正则表达式(通过其他正则表达式)并将其应用于 DNA 序列,但还有其他方法。 (可能值得将其分成至少一个其他子部分。)

Note that in Enzymes, when you store an enzyme in the hash the name of the enzyme should be the key and the site should be the value (since in principle two enzymes could have identical sites).

In the Main routine, you can iterate through the hash; for each enzyme produce one output file. The most direct way is to translate the site to a regex (by means of other regexs) and apply it to the DNA sequence, but there are other ways. (It is probably worth splitting this off into at least one other sub.)

半暖夏伤 2024-10-11 23:48:40

这是我尝试解决问题的方法(代码如下)。

1) 从参数中获取文件名并创建相应的文件句柄
2) 为指定格式的输出文件创建一个新的文件句柄
3)从第一个文件中提取“切点”
4) 第二个文件中的 DNA 序列在步骤3中获得的切割点上循环。

#!/usr/bin/perl
use strict;
use warnings;
my $file_enzyme=$ARGV[0];
my $file_dna=$ARGV[1];

open DNASEQ, ">$file_dna"."_"."$file_enzyme";
open ENZYME, "$file_enzyme";
open DNA, "$file_dna";
while (<ENZYME>) {
 chomp;
  if( /'(.*)\/\//) { # Extracts the cut point between ' & // in the enzyme file
    my $pattern=$1;
    while (<DNA>) {
     chomp;
     #print $pattern;
     my @output=split/$pattern/,;
     print DNASEQ shift @output,"\n"; #first recognized sequence being output
     foreach (@output) {
        print DNASEQ "$pattern$_\n"; #prefixing the remaining sequences with the cut point pattern
     }
   }
 }
}
close DNA;
close ENZYME;
close DNASEQ;

Here is how I have gone about trying to solve the problem (code below).

1) The file names are picked up from the arguments and respective filehandles are created.
2) A new file handle is created for the output file which in the specified format
3) The "cut points" are extracted from the first file
4) The DNA Sequences in the second file are looped over the cut points obtained in step 3.

#!/usr/bin/perl
use strict;
use warnings;
my $file_enzyme=$ARGV[0];
my $file_dna=$ARGV[1];

open DNASEQ, ">$file_dna"."_"."$file_enzyme";
open ENZYME, "$file_enzyme";
open DNA, "$file_dna";
while (<ENZYME>) {
 chomp;
  if( /'(.*)\/\//) { # Extracts the cut point between ' & // in the enzyme file
    my $pattern=$1;
    while (<DNA>) {
     chomp;
     #print $pattern;
     my @output=split/$pattern/,;
     print DNASEQ shift @output,"\n"; #first recognized sequence being output
     foreach (@output) {
        print DNASEQ "$pattern$_\n"; #prefixing the remaining sequences with the cut point pattern
     }
   }
 }
}
close DNA;
close ENZYME;
close DNASEQ;
沫雨熙 2024-10-11 23:48:40

我知道已经有几个答案了,但是嘿...我只是想试试运气,所以这是我的建议:

#!/usr/bin/perl

use warnings;
use strict;
use Getopt::Long;

my ($enz_file, $dna_file);

GetOptions( "e=s" => \$enz_file,
            "d=s" => \$dna_file,
          );

if (! $enz_file || ! $dna_file) {
   # some help text 
   print STDERR<<EOF; 

   Usage: restriction.pl -e enzyme_file -d DNA_file

   The enzyme_file should contain one enzyme entry per line.
   The DNA_file may contain the sequence on one single or on
   several lines; all lines will be concatenated to yield a
   single string.
EOF      
   exit();
}

my %enz_and_patterns; # stores enzyme name and corresponding pattern

open ENZ, "<$enz_file" or die "Could not open file $enz_file: $!";
while (<ENZ>) {
   if (m#^(\w+)/([\w']+)//$#) {
      my $enzyme  = $1; # could also remove those two lines and use 
      my $pattern = $2; # the match variables directly, but this is clearer

      $enz_and_patterns{$enzyme} = $pattern;
   }
}
close ENZ;

my $dna_sequence;

open DNA, "<$dna_file" or die "Could not open file $dna_file: $!";
while (my $line = <DNA>) {
   chomp $line;
   $dna_sequence .= $line; # append the current bit to the sequence
                           # that has been read so far
}
close DNA;

foreach my $enzyme (keys %enz_and_patterns) {
   my $dna_seq_processed = $dna_sequence; # local copy so that we retain the original

   # now translate the restriction pattern to a regular expression pattern:
   my $pattern = $enz_and_patterns{$enzyme};
   $pattern    =~ s/R/[GA]/g; # use character classes
   $pattern    =~ s/B/[^A]/g;
   $pattern    =~ s/(.+)'(.+)/($1)($2)/; # remove the ', but due to the grouping
                                         # we "remember" its position

   $dna_seq_processed =~ s/$pattern/$1\n$2/g; # in effect we are simply replacing
                                              # each ' with a newline character
   my $outfile = "${dna_file}_$enzyme";
   open OUT, ">$outfile" or die "Could not open file $outfile: $!";
   print OUT $dna_seq_processed , "\n";
   close OUT;
}

我已经用你的 TryII 示例测试了我的代码,效果很好。

由于这是一项只需编写几行非重复代码即可处理的任务,因此我认为创建单独的子例程是不合理的。我希望我能被原谅...:)

I know there have been several answers already, but hey... I just felt like trying my luck, so here's my suggestion:

#!/usr/bin/perl

use warnings;
use strict;
use Getopt::Long;

my ($enz_file, $dna_file);

GetOptions( "e=s" => \$enz_file,
            "d=s" => \$dna_file,
          );

if (! $enz_file || ! $dna_file) {
   # some help text 
   print STDERR<<EOF; 

   Usage: restriction.pl -e enzyme_file -d DNA_file

   The enzyme_file should contain one enzyme entry per line.
   The DNA_file may contain the sequence on one single or on
   several lines; all lines will be concatenated to yield a
   single string.
EOF      
   exit();
}

my %enz_and_patterns; # stores enzyme name and corresponding pattern

open ENZ, "<$enz_file" or die "Could not open file $enz_file: $!";
while (<ENZ>) {
   if (m#^(\w+)/([\w']+)//$#) {
      my $enzyme  = $1; # could also remove those two lines and use 
      my $pattern = $2; # the match variables directly, but this is clearer

      $enz_and_patterns{$enzyme} = $pattern;
   }
}
close ENZ;

my $dna_sequence;

open DNA, "<$dna_file" or die "Could not open file $dna_file: $!";
while (my $line = <DNA>) {
   chomp $line;
   $dna_sequence .= $line; # append the current bit to the sequence
                           # that has been read so far
}
close DNA;

foreach my $enzyme (keys %enz_and_patterns) {
   my $dna_seq_processed = $dna_sequence; # local copy so that we retain the original

   # now translate the restriction pattern to a regular expression pattern:
   my $pattern = $enz_and_patterns{$enzyme};
   $pattern    =~ s/R/[GA]/g; # use character classes
   $pattern    =~ s/B/[^A]/g;
   $pattern    =~ s/(.+)'(.+)/($1)($2)/; # remove the ', but due to the grouping
                                         # we "remember" its position

   $dna_seq_processed =~ s/$pattern/$1\n$2/g; # in effect we are simply replacing
                                              # each ' with a newline character
   my $outfile = "${dna_file}_$enzyme";
   open OUT, ">$outfile" or die "Could not open file $outfile: $!";
   print OUT $dna_seq_processed , "\n";
   close OUT;
}

I've tested my code with your TryII example, which worked fine.

As this is a task which can be handled by writing just a few lines of non-repetitive code I didn't feel creating separate subroutines would have been justified. I hope I will be forgiven... :)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文