如何获得成对的“序列相似性评分”约 1000 种蛋白质?
我有大量 fasta 格式的蛋白质序列。
我想获得每对蛋白质的成对序列相似性得分。
R 中的任何包都可以用于获取蛋白质序列的blast 相似性评分吗?
I have a large number of protein sequences in fasta format.
I want to get the pair-wise sequence similarity score for each pairs of the proteins.
Any package in R could be used to get the blast similarity score for protein sequences?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
根据 Chase 的建议,
bioconductor
确实是可行的方法,特别是Biostrings
包。要安装后者,我建议安装核心bioconductor
库:这样您将涵盖所有依赖项。现在,要比对 2 个蛋白质序列或任何两个序列,您需要使用
Biostrings
中的pairwiseAlignment
。给定一个包含 2 个序列的 fastaprotseq.fasta
文件,如下所示:如果您想使用 BLOSUM100 作为替换矩阵来全局对齐这 2 个序列,则打开空位的惩罚为 0,打开空位的惩罚为 -5然后扩展一个:
结果是(删除一些对齐以节省空间):
仅提取每个对齐的分数:
鉴于此,您现在可以轻松地使用一些非常简单的循环逻辑来完成所有成对对齐。为了更好地使用
bioconductor
进行成对比对,我建议您阅读 这个。另一种方法是进行多序列比对而不是成对比对。您可以使用 bio3d 并从那里 seqaln 函数用于对齐 fasta 文件中的所有序列。
As per Chase's suggestion,
bioconductor
is indeed the way to go and in particular theBiostrings
package. To install the latter I would suggest installing the corebioconductor
library as such:This way you will cover all dependencies. Now, to align 2 protein sequences or any two sequences for that matter you will need to use
pairwiseAlignment
fromBiostrings
. Given a fastaprotseq.fasta
file of 2 sequences that looks like this:If you want to globally align these 2 sequences using lets say BLOSUM100 as your substitution matrix, 0 penalty for opening a gap and -5 for extending one then:
The result of this is (removed some of the alignment to save space):
To only extract the score for each alignment:
Given this you can easily now do all pairwise alignments with some very simple looping logic. To get a better hang of pairwise alignment using
bioconductor
I suggest you read this.An alternative approach would be to do a multiple sequence alignment instead of pairwise. You could use bio3d and from there the seqaln function to align all sequences in your fasta file.
6 年后,但是:
protr
包刚刚发布,它有一个并行的成对相似性评分函数parGOsim()
。它可以获取蛋白质序列列表,因此不需要编写循环。6 years later, but:
The
protr
package just came out, which has a parallelized pairwise similarity scoring function,parGOsim()
. It can take lists of protein sequences, so a loop wouldn't be necessary to write.