在 Python 的 ASCII 文件中查找/替换带注释的子字符串

发布于 2024-11-05 09:29:34 字数 837 浏览 7 评论 0原文

我在正在进行的生物信息学项目中遇到了一些编码问题。基本上，我的任务是从数据库中提取基序序列并使用该信息来注释序列比对文件。比对文件是纯文本，因此注释不会很复杂，最多只是在比对文件本身中用星号替换提取的序列。

我有一个脚本可以扫描数据库文件，提取我需要的所有序列，并将它们写入输出文件。我需要的是，给定一个查询，读取这些序列并将它们与 ASCII 对齐文件中相应的子字符串进行匹配。最后，对于每次出现的主题序列（非常大的字符串的子串），我都会用星号序列 * 替换主题序列 XXXXXXX。

我使用的代码如下（11SGLOBULIN 是数据库中蛋白质条目的名称）：

motif_file = open('/users/myfolder/final motifs_11SGLOBULIN','r')
align_file = open('/Users/myfolder/alignmentfiles/11sglobulin.seqs', 'w+') 
finalmotifs = motif_file.readlines()
seqalign = align_file.readlines() 


for line in seqalign:
    if motif[i] in seqalign:  # I have stored all motifs in a list called "motif"
        replace(motif, '*****')

但它不是用星号序列替换每个字符串，而是删除整个文件。谁能明白为什么会发生这种情况？

我怀疑问题可能在于我的 ASCII 文件基本上只是一个很长的氨基酸列表，而 Python 不知道如何替换隐藏在很长的字符串中的特定子字符串。

原文

I'm having a little coding issue in a bioinformatics project I'm working on. Basically, my task is to extract motif sequences from a database and use the information to annotate a sequence alignment file. The alignment file is plain text, so the annotation will not be anything elaborate, at best simply replacing the extracted sequences with asterisks in the alignment file itself.

I have a script which scans the database file, extracts all sequences I need, and writes them to an output file. What I need is, given a query, to read these sequences and match them to their corresponding substrings in the ASCII alignment files. Finally, for every occurrence of a motif sequence (substring of a very large string of characters) I would replace motif sequence XXXXXXX with a sequence of asterisks *.

The code I am using goes like this (11SGLOBULIN is the name of the protein entry in the database):

motif_file = open('/users/myfolder/final motifs_11SGLOBULIN','r')
align_file = open('/Users/myfolder/alignmentfiles/11sglobulin.seqs', 'w+') 
finalmotifs = motif_file.readlines()
seqalign = align_file.readlines() 


for line in seqalign:
    if motif[i] in seqalign:  # I have stored all motifs in a list called "motif"
        replace(motif, '*****')

But instead of replacing each string with a sequence of asterisks, it deletes the entire file. Can anyone see why this is happening?

I suspect that the problem may lie in the fact that my ASCII file is basically just one very long list of amino acids, and Python cannot know how to replace a particular substring hidden within a very long string.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦里兽 2024-11-12 09:29:34

像下面这样的东西应该可以解决问题。我对您的输入数据做出了假设，因为您尚未发布示例并且您正在运行 python 2.7。

motifs = [ x.strip() for x in open('final motifs_11SGLOBULIN','r') ]
redact = '*****'

with open('11sglobulin.seqs','r') as data_in, open('11sglobulin.seqs.new','w') as data_out:
  for seq in data_in:
    for motif in motifs:
      while True:
        x = seq.find(motif)
        if x >= 0:
          seq = seq[:x] + redact + seq[x+len(motif):]
        else:
          break
  data_out.write(seq)

Something like the following should do the trick. I've made assumptions about your input data as you've not posted samples and that you're running python 2.7.

motifs = [ x.strip() for x in open('final motifs_11SGLOBULIN','r') ]
redact = '*****'

with open('11sglobulin.seqs','r') as data_in, open('11sglobulin.seqs.new','w') as data_out:
  for seq in data_in:
    for motif in motifs:
      while True:
        x = seq.find(motif)
        if x >= 0:
          seq = seq[:x] + redact + seq[x+len(motif):]
        else:
          break
  data_out.write(seq)

回复收藏 0 原文

夜唯美灬不弃 2024-11-12 09:29:34

您误解了 w+ 文件模式。使用模式 w+ 和 open 将截断文件（即删除其中的所有内容），请参阅：http://docs.python.org/library/functions.html#open。
一旦您调用以下命令，您的 seq 数据就会消失：

align_file = open('/Users/myfolder/alignmentfiles/11sglobulin.seqs', 'w+')

另外，replace 也会消失对从文件中读取的字符串进行操作。您需要显式地将更改后的字符串写回。

最好的选择是使用第三个文件来存储结果。如果您确实愿意，可以在完成后将生成的文件复制到原始 align_file 上。

回复收藏 0 原文

你与昨日 2024-11-12 09:29:34

来进一步简化这一点

while True:
    x = seq.find(motif)
    if x >= 0:
      seq = seq[:x] + redact + seq[x+len(motif):]
    else:
      break

您可以通过将最里面的 while 循环从：更改为：

if motif in seq:
  seq = seq.replace(motif, redact)

You could simplify this a little more by changing the innermost while loop from:

while True:
    x = seq.find(motif)
    if x >= 0:
      seq = seq[:x] + redact + seq[x+len(motif):]
    else:
      break

to:

if motif in seq:
  seq = seq.replace(motif, redact)

回复收藏 0 原文

内心荒芜 2024-11-12 09:29:34

谢谢大家，我真的很感谢大家的回复，抱歉回复太慢了。因此，基本上我应该做的是，正如许多人指出的那样，打开文件进行注释并将这些注释写入新文件。这段代码就达到了目的：

align_file_rmode = open('/Users/spyros/folder1/python/printsmotifs/alignfiles/query, 'r') 
align_file_amode = open('/Users/spyros/folder1/python/printsmotifs/alignfiles/query, 'a+')

finalmotifs = motif_file.readlines()
seqalign = align_file_rmode.readlines() 

for line in seqalign: 
   for item in finalmotifs:
      item = item.strip().upper()
      if item in line:
         line = line.replace(item, '
 * len(item)) 
         align_file_amode.write(line) 

motif_file.close()
align_file_rmode.close()
align_file_amode.close()

Thanks everyone, I really appreciate the responses, sorry for the dealy in answering. So basically what I should have been doing was, as many pointed out, open the file to annotate and write those annotations to a new file. This bit of code did the trick:

align_file_rmode = open('/Users/spyros/folder1/python/printsmotifs/alignfiles/query, 'r') 
align_file_amode = open('/Users/spyros/folder1/python/printsmotifs/alignfiles/query, 'a+')

finalmotifs = motif_file.readlines()
seqalign = align_file_rmode.readlines() 

for line in seqalign: 
   for item in finalmotifs:
      item = item.strip().upper()
      if item in line:
         line = line.replace(item, '
 * len(item)) 
         align_file_amode.write(line) 

motif_file.close()
align_file_rmode.close()
align_file_amode.close()

回复收藏 0 原文

~没有更多了~