如何在 R 中执行基本的多序列比对?

发布于 2024-10-08 14:17:45 字数 2092 浏览 10 评论 0 原文

(我曾尝试在 BioStars 上询问这个问题,但有可能有人通过短信挖掘会认为有更好的解决方案,我也在这里重新发布)

我想要实现的任务是对齐几个序列。

我没有可以匹配的基本模式。我所知道的是“真实”模式的长度应该是“30”,并且我的序列在随机点引入了缺失值。

这是此类序列的一个示例,在左侧我们看到缺失值的真实位置,在右侧我们看到我们能够观察到的序列。

我的目标是仅使用右列上的序列来重建左列(基于每个位置的许多字母都是相同的事实)

                     Real_sequence           The_sequence_we_see
1   CGCAATACTAAC-AGCTGACTTACGCACCG CGCAATACTAACAGCTGACTTACGCACCG
2   CGCAATACTAGC-AGGTGACTTCC-CT-CG   CGCAATACTAGCAGGTGACTTCCCTCG
3   CGCAATGATCAC--GGTGGCTCCCGGTGCG  CGCAATGATCACGGTGGCTCCCGGTGCG
4   CGCAATACTAACCA-CTAACT--CGCTGCG   CGCAATACTAACCACTAACTCGCTGCG
5   CGCACGGGTAAGAACGTGA-TTACGCTCAG CGCACGGGTAAGAACGTGATTACGCTCAG
6   CGCTATACTAACAA-GTG-CTTAGGC-CTG   CGCTATACTAACAAGTGCTTAGGCCTG
7   CCCA-C-CTAA-ACGGTGACTTACGCTCCG   CCCACCTAAACGGTGACTTACGCTCCG

下面是重现上述示例的示例代码:

ATCG <- c("A","T","C","G")
set.seed(40)
original.seq <- sample(ATCG, 30, T)
seqS <- matrix(original.seq,200,30, T)
change.letters <- function(x, number.of.changes = 15, letters.to.change.with = ATCG) 
{
    number.of.changes <- sample(seq_len(number.of.changes), 1)
    new.letters <- sample(letters.to.change.with , number.of.changes, T)
    where.to.change.the.letters <- sample(seq_along(x) , number.of.changes, F)
    x[where.to.change.the.letters] <- new.letters
    return(x)
}
change.letters(original.seq)
insert.missing.values <- function(x) change.letters(x, 3, "-") 
insert.missing.values(original.seq)

seqS2 <- t(apply(seqS, 1, change.letters))
seqS3 <- t(apply(seqS2, 1, insert.missing.values))

seqS4 <- apply(seqS3,1, function(x) {paste(x, collapse = "")})
require(stringr)
# library(help=stringr)
all.seqS <- str_replace(seqS4,"-" , "")

# how do we allign this?
data.frame(Real_sequence = seqS4, The_sequence_we_see = all.seqS)

我理解如果我拥有的只是一个字符串和一个模式,我就可以使用

library(Biostrings)
pairwiseAlignment(...)

但在我目前的情况下,我们正在处理许多序列以相互对齐(而不是将它们对齐到一个模式)。

R 中有已知的方法可以做到这一点吗?

(I've tried asking this on BioStars, but for the slight chance that someone from text mining would think there is a better solution, I am also reposting this here)

The task I'm trying to achieve is to align several sequences.

I don't have a basic pattern to match to. All that I know is that the "True" pattern should be of length "30" and that the sequences I have had missing values introduced to them at random points.

Here is an example of such sequences, were on the left we see what is the real location of the missing values, and on the right we see the sequence that we will be able to observe.

My goal is to reconstruct the left column using only the sequences I've got on the right column (based on the fact that many of the letters in each position are the same)

                     Real_sequence           The_sequence_we_see
1   CGCAATACTAAC-AGCTGACTTACGCACCG CGCAATACTAACAGCTGACTTACGCACCG
2   CGCAATACTAGC-AGGTGACTTCC-CT-CG   CGCAATACTAGCAGGTGACTTCCCTCG
3   CGCAATGATCAC--GGTGGCTCCCGGTGCG  CGCAATGATCACGGTGGCTCCCGGTGCG
4   CGCAATACTAACCA-CTAACT--CGCTGCG   CGCAATACTAACCACTAACTCGCTGCG
5   CGCACGGGTAAGAACGTGA-TTACGCTCAG CGCACGGGTAAGAACGTGATTACGCTCAG
6   CGCTATACTAACAA-GTG-CTTAGGC-CTG   CGCTATACTAACAAGTGCTTAGGCCTG
7   CCCA-C-CTAA-ACGGTGACTTACGCTCCG   CCCACCTAAACGGTGACTTACGCTCCG

Here is an example code to reproduce the above example:

ATCG <- c("A","T","C","G")
set.seed(40)
original.seq <- sample(ATCG, 30, T)
seqS <- matrix(original.seq,200,30, T)
change.letters <- function(x, number.of.changes = 15, letters.to.change.with = ATCG) 
{
    number.of.changes <- sample(seq_len(number.of.changes), 1)
    new.letters <- sample(letters.to.change.with , number.of.changes, T)
    where.to.change.the.letters <- sample(seq_along(x) , number.of.changes, F)
    x[where.to.change.the.letters] <- new.letters
    return(x)
}
change.letters(original.seq)
insert.missing.values <- function(x) change.letters(x, 3, "-") 
insert.missing.values(original.seq)

seqS2 <- t(apply(seqS, 1, change.letters))
seqS3 <- t(apply(seqS2, 1, insert.missing.values))

seqS4 <- apply(seqS3,1, function(x) {paste(x, collapse = "")})
require(stringr)
# library(help=stringr)
all.seqS <- str_replace(seqS4,"-" , "")

# how do we allign this?
data.frame(Real_sequence = seqS4, The_sequence_we_see = all.seqS)

I understand that if all I had was a string and a pattern I would be able to use

library(Biostrings)
pairwiseAlignment(...)

But in the case I present we are dealing with many sequences to align to one another (instead of aligning them to one pattern).

Is there a known method for doing this in R?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

以酷 2024-10-15 14:17:53

您正在寻找针对多个序列的全局比对算法。
你问之前看过维基百科吗?

首先了解什么是全局对齐,然后查找多序列比对

维基百科没有提供很多有关算法的详细信息,但这篇论文更好。

You are looking for a global alignment algorithm on multiple sequences.
Did you look at Wikipedia before asking ?

First learn what global alignment is, then look for multiple sequence alignment.

Wikipedia doesn't give a lot of details about algorithms, but this paper is better.

任谁 2024-10-15 14:17:52

您可以使用 DECIPHER 包在 R 中执行多重对齐。

按照您的示例,它看起来像:

library(DECIPHER)
dna <- DNAStringSet(all.seqS)
aligned_DNA <- AlignSeqs(dna)

它速度快,至少与此处列出的其他方法一样准确(请参阅 论文)。我希望这有帮助!

You can perform multiple alignment in R with the DECIPHER package.

Following your example, it would look something like:

library(DECIPHER)
dna <- DNAStringSet(all.seqS)
aligned_DNA <- AlignSeqs(dna)

It is fast and at least as accurate as the other methods listed here (see the paper). I hope that helps!

深海蓝天 2024-10-15 14:17:51

在我看来,在 R 中编写对齐算法似乎是个坏主意,但是 MUSCLE bio3d 包中的算法(函数 seqaln())。请注意,您必须首先安装此算法。

或者,您可以使用任何可用的算法(例如 ClustalWMAFFTT-COFFEE) 并使用 生物导体功能。 参见此处。

Writing an alignment algorithm in R looks like a bad idea to me, but there is an R interface to the MUSCLE algorithm in the bio3d package (function seqaln()). Be aware of the fact that you have to install this algorithm first.

Alternatively, you can use any of the available algorithms (eg ClustalW, MAFFT, T-COFFEE) and import the multiple sequence alignemts in R using bioconductor functionality. See eg here..

北座城市 2024-10-15 14:17:49

虽然这是一个相当老的线程,但我不想错过这个机会,从 Bioconductor 3.1 开始,有一个包“msa”,它实现了三种不同的多序列比对算法的接口:ClustalW 、ClustalOmega 和肌肉。该软件包可在所有主要平台(Linux/Unix、Mac OS 和 Windows)上运行,并且是独立的,您不需要安装任何外部软件。更多信息请访问 http://www.bioinf.jku.at/software/msa/ http://www.bioconductor.org/packages/发布/bioc/html/msa.html

Though this is quite an old thread, I do not want to miss the opportunity to mention that, since Bioconductor 3.1, there is a package 'msa' that implements interfaces to three different multiple sequence alignment algorithms: ClustalW, ClustalOmega, and MUSCLE. The package runs on all major platforms (Linux/Unix, Mac OS, and Windows) and is self-contained in the sense that you need not install any external software. More information can be found on http://www.bioinf.jku.at/software/msa/ and http://www.bioconductor.org/packages/release/bioc/html/msa.html.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文