r dna-sequence string-algorithm sequence-alignment

如何在r中进行文本字符串（UTF8）的多序列对齐

发布于 2025-01-31 23:02:43 字数 535 浏览 2 评论 0原文

给定三个字符串：

seq <- c("abcd", "bcde", "cdef", "af", "cdghi")

我想执行多个序列对齐，以便得到以下结果：

abcd
 bcde
  cdef
a    f
  cd  ghi

使用MSA软件包中使用MSA（）函数，我尝试了

msa(seq, type = "protein", order = "input", method = "Muscle")

以下结果：

    aln     names
 [1] ABCD--- Seq1
 [2] -BCDE-- Seq2
 [3] --CD-EF Seq3
 [4] -----AF Seq4
 [5] --CDGHI Seq5
 Con --CD-?? Consensus

我想将此函数用于可以可以的序列包含任何Unicode字符，但在此示例中已经发出警告：找到无效的字母。有什么想法吗？

原文

Given three strings:

seq <- c("abcd", "bcde", "cdef", "af", "cdghi")

I would like to do multiple sequence alignment so that I get the following result:

abcd
 bcde
  cdef
a    f
  cd  ghi

Using the msa() function from the msa package I tried:

msa(seq, type = "protein", order = "input", method = "Muscle")

and got the following result:

    aln     names
 [1] ABCD--- Seq1
 [2] -BCDE-- Seq2
 [3] --CD-EF Seq3
 [4] -----AF Seq4
 [5] --CDGHI Seq5
 Con --CD-?? Consensus

I would like to use this function for sequences that can contain any unicode characters, but already in this example the function gives a warning: invalid letters found. Any ideas?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

燕归巢 2025-02-07 23:02:43

这是基本R中的一个解决方案，输出一个表：

seq <- c("abcd", "bcde", "cdef", "af", "cdghi")

all_chars <- unique(unlist(strsplit(seq, "")))

tab <- t(apply(do.call(rbind, lapply(strsplit(seq, ""), 
       function(x) table(factor(x, all_chars)))), 1,
       function(x) ifelse(x == 1, all_chars, " ")))

我们可以不用引号打印输出以更清楚地看到它：

print(tab, quote = FALSE)
#>      a b c d e f g h i
#> [1,] a b c d          
#> [2,]   b c d e        
#> [3,]     c d e f      
#> [4,] a         f      
#> [5,]     c d     g h i

^{在2022-05-25上创建的 reprex软件包（v2.0.1）}

Here's a solution in base R that outputs a table:

seq <- c("abcd", "bcde", "cdef", "af", "cdghi")

all_chars <- unique(unlist(strsplit(seq, "")))

tab <- t(apply(do.call(rbind, lapply(strsplit(seq, ""), 
       function(x) table(factor(x, all_chars)))), 1,
       function(x) ifelse(x == 1, all_chars, " ")))

We can print the output without quotes to see it more clearly:

print(tab, quote = FALSE)
#>      a b c d e f g h i
#> [1,] a b c d          
#> [2,]   b c d e        
#> [3,]     c d e f      
#> [4,] a         f      
#> [5,]     c d     g h i

^{Created on 2022-05-25 by the reprex package (v2.0.1)}

回复收藏 0 原文

指尖上的星空 2025-02-07 23:02:43

解决方案是使用lingpy。首先根据说明在：。然后运行：

library(reticulate)

builtins <- import_builtins()
lingpy   <- import("lingpy")

seqs <- c("mɪlk","mɔˑlkə","mɛˑlək","mɪlɪx","mɑˑlʲk")

multi <- lingpy$Multiple(seqs)
multi$prog_align()
builtins$print(multi)

输出：

m   ɪ   l   -   k   -
m   ɔˑ  l   -   k   ə
m   ɛˑ  l   ə   k   -
m   ɪ   l   ɪ   x   -
m   ɑˑ  lʲ  -   k   -

A solution is to use LingPy. First install LingPy according to the instructions at: http://lingpy.org/tutorial/installation.html. Then run:

library(reticulate)

builtins <- import_builtins()
lingpy   <- import("lingpy")

seqs <- c("mɪlk","mɔˑlkə","mɛˑlək","mɪlɪx","mɑˑlʲk")

multi <- lingpy$Multiple(seqs)
multi$prog_align()
builtins$print(multi)

Output:

m   ɪ   l   -   k   -
m   ɔˑ  l   -   k   ə
m   ɛˑ  l   ə   k   -
m   ɪ   l   ɪ   x   -
m   ɑˑ  lʲ  -   k   -

回复收藏 0 原文

~没有更多了~

关于作者

放飞的风筝

暂无简介

文章

28 人气

关注发私信

夢野间

文章 0 评论 0

关注

百度③文鱼

文章 0 评论 0

关注

小草泠泠

文章 0 评论 0

关注

zhuwenyan

文章 0 评论 0

关注

weirdo

文章 0 评论 0

关注

坚持沉默

文章 0 评论 0

友情链接

文江博客

如何在r中进行文本字符串（UTF8）的多序列对齐

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

夢野间

百度③文鱼

小草泠泠

zhuwenyan

weirdo

坚持沉默

友情链接

如何在r中进行文本字符串（UTF8）的多序列对齐

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

夢野间

百度③文鱼

小草泠泠

zhuwenyan

weirdo

坚持沉默

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。