Perl 程序使用引用、哈希表和子集来模拟限制性内切酶
我是 Perl 入门课程的学生。我正在寻找有关如何完成任务的建议。我的教授鼓励论坛。任务是:
编写一个 Perl 程序,该程序将从命令行获取两个文件:一个酶文件和一个 DNA 文件。使用限制性内切酶读取文件并将限制性内切酶应用于 DNA 文件。
输出将是按照 DNA 文件中出现的顺序排列的 DNA 片段。输出文件的名称应通过将限制性酶的名称附加到 DNA 文件的名称来构建,并在它们之间使用下划线。
例如,如果酶是 EcoRI 并且 DNA 文件名为 BC161026,则输出文件应命名为 BC161026_EcoRI。
我的方法是创建一个主程序和两个子程序,如下所示:
主要: 不确定如何将我的潜艇绑在一起?
子程序$DNA: 获取 DNA 文件并删除所有新行以生成单个字符串
子程序酶: 读取并存储来自命令行的酶文件中的行 以将酶首字母缩略词与剪切位置分开的方式解析文件。 将剪切位置作为正则表达式存储在哈希表中 将首字母缩略词的名称存储在哈希表中
酶文件格式注意事项: 酶文件遵循称为 Staden 的格式。示例:
AatI/AGG'CCT//
AatII/GACGT'C//
AbsI/CC'TCGAGG//
酶首字母缩略词由第一个斜杠之前的字符组成(AatI,在第一个示例中。识别序列是第一个和第二个斜杠之间的所有内容(AGG'CCT,同样,在第一个示例中)。切点是在识别序列中用撇号表示 酶内 DNA 的标准缩写如下:
R = G 或 A B = 非 A(C 或 G 或 T) 等等...
除了对主要块的建议之外,您是否看到我遗漏的任何缺失部分?您能否推荐您认为对修补该程序有用的特定工具?
输入酶示例:TryII/RRR'TTT//
要读取的示例字符串:CCCCCCGGGTTTCCCCCCCCCCCCAAATTTCCCCCCCCCCCCAGATTTCCCCCCCCCCGAGTTTCCCCC
输出应为:
CCCCCCGGGG
TTTCCCCCCCCCCCCAAA
TTTCCCCCCCCCCCCAGA
TTTCCCCCCCCCCGAG
TTCCCCCC
I'm a student in an intro Perl class. I'm looking for suggestions on how to approach an assignment. My professor encourages forums. The assignment is:
Write a Perl program that will take two files from the command line, an enzyme file and a DNA file. Read the file with restriction enzymes and apply the restriction enzymes to the DNA file.
The output will be fragments of DNA arranged in the order they occur in the dna file. The name of the output files should be constructed by appending the name of the restriction enzyme to the name of the DNA file, with an underscore between them.
For example, if the enzyme is EcoRI and the DNA file is named BC161026, the output file should be named BC161026_EcoRI.
My approach is to create a main program and two subs as follows:
Main:
Not sure how to tie my subs together?
Sub program $DNA:
Take a DNA file and remove any new lines to make a single string
Sub program Enzymes:
Read and store the lines from the enzyme file which is from the command line
Parse the file in a way that it separates the enzyme acronym from the position of the cut.
Store the position of the cut as a regular expression in a hash table
Store the name of the acronym in a hash table
Note on enzyme file format:
The enzyme file follows a format known as Staden. Examples:
AatI/AGG'CCT//
AatII/GACGT'C//
AbsI/CC'TCGAGG//
The enzyme acronym consists of the characters before the first slash (AatI, in the first example. The recognition sequence is everything between the first and second slash (AGG'CCT, again, in the first example). The cut point is denoted by an apostrophe in the recognition sequence
There are standard abbreviations for dna within enzymes as follows:R = G or A
B = not A (C or G or T)
etc...
Along with a recommendation for a main chunk, do you see any missing pieces that I've omitted? Can you recommend specific tools that you think would be useful in patching this program together?
Example input enzyme: TryII/RRR'TTT//
Example string to read: CCCCCCGGGTTTCCCCCCCCCCCCAAATTTCCCCCCCCCCCCAGATTTCCCCCCCCCCGAGTTTCCCCC
The output should be:
CCCCCCGGG
TTTCCCCCCCCCCCCAAA
TTTCCCCCCCCCCCCAGA
TTTCCCCCCCCCCGAG
TTTCCCCC
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
好吧,我知道我不应该只做你的作业,但是这个有一些有趣的技巧,所以我玩了一下。从中学习,而不是仅仅复制。我评论得不太好,有不明白的地方请追问。这其中有一些微妙的魔力,如果你在课堂上没有讲过,你的教授会知道,所以一定要理解。
编辑:明确添加了 cut_dna 初始化,因为这是每种酶的最终结果持有者,所以我认为更清楚地看到它会很好。
编辑2:添加了输出例程、调用、标志和相应的帮助。
编辑 3:更改了主例程,以结合 Canavanin 的最佳方法,同时删除循环。现在它是一个全局替换,添加临时剪切标记('),然后将剪切标记拆分到数组中。留下旧方法作为注释,新方法是下面的 5 行。
编辑 4:用于写入多个文件的附加测试用例。 (以下)
Ok, I know I shouldn't just do your homework, but there were some fun tricks to this one, so I played with it. Learn from this, not just copy. I didn't comment very well, so if there is something you don't understand, please ask. There is some slight magic in this that if you haven't covered it in your class, your prof will know, so be sure you understand.
Edit: Added cut_dna initialization explicitly because this is the final result holder for each enzyme, so I thought it would be good to see it more clearly.
Edit 2: Added output routine, call, flag and corresponding help.
Edit 3: Changed main routine to incorporate the best of canavanin's method while removing loops. Now its a global replace to add temporary cut mark (') and then split on cut mark into array. Left old method as comment, new method is the 5 lines following.
Edit 4: Additional test case for writing to multiple files. (Below)
请注意,在酶中,当您将酶存储在哈希中时,酶的名称应该是键,位点应该是值(因为原则上两种酶可以具有相同的位点)。
在 Main 例程中,您可以迭代哈希;对于每种酶产生一个输出文件。最直接的方法是将位点翻译为正则表达式(通过其他正则表达式)并将其应用于 DNA 序列,但还有其他方法。 (可能值得将其分成至少一个其他子部分。)
Note that in Enzymes, when you store an enzyme in the hash the name of the enzyme should be the key and the site should be the value (since in principle two enzymes could have identical sites).
In the Main routine, you can iterate through the hash; for each enzyme produce one output file. The most direct way is to translate the site to a regex (by means of other regexs) and apply it to the DNA sequence, but there are other ways. (It is probably worth splitting this off into at least one other sub.)
这是我尝试解决问题的方法(代码如下)。
1) 从参数中获取文件名并创建相应的
文件句柄
。2) 为指定格式的输出文件创建一个新的文件句柄
3)从第一个文件中提取“切点”
4) 第二个文件中的 DNA 序列在步骤3中获得的切割点上循环。
Here is how I have gone about trying to solve the problem (code below).
1) The file names are picked up from the arguments and respective
filehandles
are created.2) A new file handle is created for the output file which in the specified format
3) The "cut points" are extracted from the first file
4) The DNA Sequences in the second file are looped over the cut points obtained in step 3.
我知道已经有几个答案了,但是嘿...我只是想试试运气,所以这是我的建议:
我已经用你的 TryII 示例测试了我的代码,效果很好。
由于这是一项只需编写几行非重复代码即可处理的任务,因此我认为创建单独的子例程是不合理的。我希望我能被原谅...:)
I know there have been several answers already, but hey... I just felt like trying my luck, so here's my suggestion:
I've tested my code with your TryII example, which worked fine.
As this is a task which can be handled by writing just a few lines of non-repetitive code I didn't feel creating separate subroutines would have been justified. I hope I will be forgiven... :)