在 Python 的 ASCII 文件中查找/替换带注释的子字符串
我在正在进行的生物信息学项目中遇到了一些编码问题。基本上,我的任务是从数据库中提取基序序列并使用该信息来注释序列比对文件。比对文件是纯文本,因此注释不会很复杂,最多只是在比对文件本身中用星号替换提取的序列。
我有一个脚本可以扫描数据库文件,提取我需要的所有序列,并将它们写入输出文件。我需要的是,给定一个查询,读取这些序列并将它们与 ASCII 对齐文件中相应的子字符串进行匹配。最后,对于每次出现的主题序列(非常大的字符串的子串),我都会用星号序列 * 替换主题序列 XXXXXXX。
我使用的代码如下(11SGLOBULIN 是数据库中蛋白质条目的名称):
motif_file = open('/users/myfolder/final motifs_11SGLOBULIN','r')
align_file = open('/Users/myfolder/alignmentfiles/11sglobulin.seqs', 'w+')
finalmotifs = motif_file.readlines()
seqalign = align_file.readlines()
for line in seqalign:
if motif[i] in seqalign: # I have stored all motifs in a list called "motif"
replace(motif, '*****')
但它不是用星号序列替换每个字符串,而是删除整个文件。谁能明白为什么会发生这种情况?
我怀疑问题可能在于我的 ASCII 文件基本上只是一个很长的氨基酸列表,而 Python 不知道如何替换隐藏在很长的字符串中的特定子字符串。
I'm having a little coding issue in a bioinformatics project I'm working on. Basically, my task is to extract motif sequences from a database and use the information to annotate a sequence alignment file. The alignment file is plain text, so the annotation will not be anything elaborate, at best simply replacing the extracted sequences with asterisks in the alignment file itself.
I have a script which scans the database file, extracts all sequences I need, and writes them to an output file. What I need is, given a query, to read these sequences and match them to their corresponding substrings in the ASCII alignment files. Finally, for every occurrence of a motif sequence (substring of a very large string of characters) I would replace motif sequence XXXXXXX with a sequence of asterisks *.
The code I am using goes like this (11SGLOBULIN is the name of the protein entry in the database):
motif_file = open('/users/myfolder/final motifs_11SGLOBULIN','r')
align_file = open('/Users/myfolder/alignmentfiles/11sglobulin.seqs', 'w+')
finalmotifs = motif_file.readlines()
seqalign = align_file.readlines()
for line in seqalign:
if motif[i] in seqalign: # I have stored all motifs in a list called "motif"
replace(motif, '*****')
But instead of replacing each string with a sequence of asterisks, it deletes the entire file. Can anyone see why this is happening?
I suspect that the problem may lie in the fact that my ASCII file is basically just one very long list of amino acids, and Python cannot know how to replace a particular substring hidden within a very long string.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
像下面这样的东西应该可以解决问题。我对您的输入数据做出了假设,因为您尚未发布示例并且您正在运行 python 2.7。
Something like the following should do the trick. I've made assumptions about your input data as you've not posted samples and that you're running python 2.7.
您误解了
w+
文件模式。使用模式w+
和open
将截断文件(即删除其中的所有内容),请参阅:http://docs.python.org/library/functions.html#open。一旦您调用以下命令,您的 seq 数据就会消失:
align_file = open('/Users/myfolder/alignmentfiles/11sglobulin.seqs', 'w+')
另外,
replace
也会消失对从文件中读取的字符串进行操作。您需要显式地将更改后的字符串写回。最好的选择是使用第三个文件来存储结果。如果您确实愿意,可以在完成后将生成的文件复制到原始
align_file
上。You are misunderstanding the
w+
file mode. Using modew+
withopen
will truncate the file (that is delete everything in it) see: http://docs.python.org/library/functions.html#open.Your seq data are gone as soon as you call:
align_file = open('/Users/myfolder/alignmentfiles/11sglobulin.seqs', 'w+')
Also
replace
is going to operate on strings read from the file. You need to explicitly write the altered strings back out.Your best bet is to use a third file to store your results. If you really want to you can copy the resulting file over the original
align_file
when you are done.来进一步简化这一点
您可以通过将最里面的 while 循环从: 更改为:
You could simplify this a little more by changing the innermost while loop from:
to:
谢谢大家,我真的很感谢大家的回复,抱歉回复太慢了。因此,基本上我应该做的是,正如许多人指出的那样,打开文件进行注释并将这些注释写入新文件。这段代码就达到了目的:
Thanks everyone, I really appreciate the responses, sorry for the dealy in answering. So basically what I should have been doing was, as many pointed out, open the file to annotate and write those annotations to a new file. This bit of code did the trick: