如何获得成对的“序列相似性评分”约 1000 种蛋白质?

发布于 2024-11-18 02:04:14 字数 99 浏览 3 评论 0原文

我有大量 fasta 格式的蛋白质序列。

我想获得每对蛋白质的成对序列相似性得分。

R 中的任何包都可以用于获取蛋白质序列的blast 相似性评分吗?

I have a large number of protein sequences in fasta format.

I want to get the pair-wise sequence similarity score for each pairs of the proteins.

Any package in R could be used to get the blast similarity score for protein sequences?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

梦境 2024-11-25 02:04:14

根据 Chase 的建议,bioconductor 确实是可行的方法,特别是 Biostrings 包。要安装后者,我建议安装核心 bioconductor 库:这样

source("http://bioconductor.org/biocLite.R")
biocLite()

您将涵盖所有依赖项。现在,要比对 2 个蛋白质序列或任何两个序列,您需要使用 Biostrings 中的 pairwiseAlignment。给定一个包含 2 个序列的 fasta protseq.fasta 文件,如下所示:

>protein1
MYRALRLLARSRPLVRAPAAALASAPGLGGAAVPSFWPPNAAR
MASQNSFRIEYDTFGELKVPNDKYYGAQTVRSTMNFKIGGVTE
RMPTPVIKAFGILKRAAAEVNQDYGLDPKIANAIMKAADEVAE
GKLNDHFPLVVWQTGSGTQTNMNVNEVISNRAIEMLGGELGSK
IPVHPNDHVNKSQ
>protein2
MRSRPAGPALLLLLLFLGAAESVRRAQPPRRYTPDWPSLDSRP
LPAWFDEAKFGVFIHWGVFSVPAWGSEWFWWHWQGEGRPYQRF
MRDNYPPGFSYADFGPQFTARFFHPEEWADLFQAAGAKYVVLT
TKHHEGFTNW*

如果您想使用 BLOSUM100 作为替换矩阵来全局对齐这 2 个序列,则打开空位的惩罚为 0,打开空位的惩罚为 -5然后扩展一个:

require("Biostrings")
data(BLOSUM100)
seqs <- readFASTA("~/Desktop/protseq.fasta", strip.descs=TRUE)
alm <- pairwiseAlignment(seqs[[1]]$seq, seqs[[2]]$seq, substitutionMatrix=BLOSUM100, gapOpening=0, gapExtension=-5)

结果是(删除一些对齐以节省空间):

> alm
Global PairwiseAlignedFixedSubject (1 of 1)
pattern: [1] MYRALRLLARSRPLVRA-PAAALAS....
subject: [1] M-R-------SRP---AGPALLLLL.... 
score: -91

仅提取每个对齐的分数:

> score(alm)
[1] -91

鉴于此,您现在可以轻松地使用一些非常简单的循环逻辑来完成所有成对对齐。为了更好地使用 bioconductor 进行成对比对,我建议您阅读 这个

另一种方法是进行多序列比对而不是成对比对。您可以使用 bio3d 并从那里 seqaln 函数用于对齐 fasta 文件中的所有序列。

As per Chase's suggestion, bioconductor is indeed the way to go and in particular the Biostrings package. To install the latter I would suggest installing the core bioconductor library as such:

source("http://bioconductor.org/biocLite.R")
biocLite()

This way you will cover all dependencies. Now, to align 2 protein sequences or any two sequences for that matter you will need to use pairwiseAlignment from Biostrings. Given a fasta protseq.fasta file of 2 sequences that looks like this:

>protein1
MYRALRLLARSRPLVRAPAAALASAPGLGGAAVPSFWPPNAAR
MASQNSFRIEYDTFGELKVPNDKYYGAQTVRSTMNFKIGGVTE
RMPTPVIKAFGILKRAAAEVNQDYGLDPKIANAIMKAADEVAE
GKLNDHFPLVVWQTGSGTQTNMNVNEVISNRAIEMLGGELGSK
IPVHPNDHVNKSQ
>protein2
MRSRPAGPALLLLLLFLGAAESVRRAQPPRRYTPDWPSLDSRP
LPAWFDEAKFGVFIHWGVFSVPAWGSEWFWWHWQGEGRPYQRF
MRDNYPPGFSYADFGPQFTARFFHPEEWADLFQAAGAKYVVLT
TKHHEGFTNW*

If you want to globally align these 2 sequences using lets say BLOSUM100 as your substitution matrix, 0 penalty for opening a gap and -5 for extending one then:

require("Biostrings")
data(BLOSUM100)
seqs <- readFASTA("~/Desktop/protseq.fasta", strip.descs=TRUE)
alm <- pairwiseAlignment(seqs[[1]]$seq, seqs[[2]]$seq, substitutionMatrix=BLOSUM100, gapOpening=0, gapExtension=-5)

The result of this is (removed some of the alignment to save space):

> alm
Global PairwiseAlignedFixedSubject (1 of 1)
pattern: [1] MYRALRLLARSRPLVRA-PAAALAS....
subject: [1] M-R-------SRP---AGPALLLLL.... 
score: -91

To only extract the score for each alignment:

> score(alm)
[1] -91

Given this you can easily now do all pairwise alignments with some very simple looping logic. To get a better hang of pairwise alignment using bioconductor I suggest you read this.

An alternative approach would be to do a multiple sequence alignment instead of pairwise. You could use bio3d and from there the seqaln function to align all sequences in your fasta file.

任谁 2024-11-25 02:04:14

6 年后,但是:

protr 包刚刚发布,它有一个并行的成对相似性评分函数 parGOsim()。它可以获取蛋白质序列列表,因此不需要编写循环。

6 years later, but:

The protr package just came out, which has a parallelized pairwise similarity scoring function, parGOsim(). It can take lists of protein sequences, so a loop wouldn't be necessary to write.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文