获取字符向量中每个元素的第三个字

发布于 2025-02-07 14:29:18 字数 1564 浏览 2 评论 0原文

我有以下字符向量称为strains

 head(strains, 10)

 [1] "Lactobacillus gasseri APC678"                    "Lactobacillus gasseri DSM 20243"                
 [3] "Bifidobacterium angulatum B677"                  "Bifidobacterium breve Reuter S1"                
 [5] "Lactobacillus reuteri F275"                      "Lactobacillus acidophilus L917"                 
 [7] "Lactobacillus acidophilus 4357"                  "Bifidobacterium pseudocatenulatum B1279"        
 [9] "Bifidobacterium longum subsp. infantis JCM 1210" "Clostridium difficile 43594"  

我要获得的是一个矢量,仅适用于应变中每个元素的第三个字。例如,在称为“乳杆菌Gasseri APC678”的元素中,我只想保留“ APC678”。

我所做的是以下内容:

library(tidyvese)

lapply(strains %>% str_split(" "), '[', 3) %>% unlist 

我想要的工作,正如您在我的代码输出中看到的那样:

 [1] "APC678" "DSM"    "B677"   "Reuter" "F275"   "L917"   "4357"   "B1279"  "subsp." "43594"  "subsp." "F275"   "1SL4"   "JCM"   
[15] "JCM"    "AM63"   "DSM"    "L917"   "61D"    "Bb14"   "AM63"   "VPI"

但是,我正在寻找更优雅或简洁的方式做同样的事情,也许使用正直或类似的东西。


这是我数据的dput

strains <- c("Lactobacillus gasseri APC678", "Lactobacillus gasseri DSM 20243", 
"Bifidobacterium angulatum B677", "Bifidobacterium breve Reuter S1", 
"Lactobacillus reuteri F275", "Lactobacillus acidophilus L917", 
"Lactobacillus acidophilus 4357", "Bifidobacterium pseudocatenulatum B1279", 
"Bifidobacterium longum subsp. infantis JCM 1210", "Clostridium difficile 43594"
)

I have the following character vector called strains :

 head(strains, 10)

 [1] "Lactobacillus gasseri APC678"                    "Lactobacillus gasseri DSM 20243"                
 [3] "Bifidobacterium angulatum B677"                  "Bifidobacterium breve Reuter S1"                
 [5] "Lactobacillus reuteri F275"                      "Lactobacillus acidophilus L917"                 
 [7] "Lactobacillus acidophilus 4357"                  "Bifidobacterium pseudocatenulatum B1279"        
 [9] "Bifidobacterium longum subsp. infantis JCM 1210" "Clostridium difficile 43594"  

What I want to get is a vector with just the 3rd word for each element in the strains. For example, in the element called "Lactobacillus gasseri APC678", I would like to just keep "APC678".

What I did is the following :

library(tidyvese)

lapply(strains %>% str_split(" "), '[', 3) %>% unlist 

Which did the work I want, as you can see in the output my code gives :

 [1] "APC678" "DSM"    "B677"   "Reuter" "F275"   "L917"   "4357"   "B1279"  "subsp." "43594"  "subsp." "F275"   "1SL4"   "JCM"   
[15] "JCM"    "AM63"   "DSM"    "L917"   "61D"    "Bb14"   "AM63"   "VPI"

However, I'm looking for a more elegant or concise way to do the same, maybe using regex or something alike.


Here is the dput of my data :

strains <- c("Lactobacillus gasseri APC678", "Lactobacillus gasseri DSM 20243", 
"Bifidobacterium angulatum B677", "Bifidobacterium breve Reuter S1", 
"Lactobacillus reuteri F275", "Lactobacillus acidophilus L917", 
"Lactobacillus acidophilus 4357", "Bifidobacterium pseudocatenulatum B1279", 
"Bifidobacterium longum subsp. infantis JCM 1210", "Clostridium difficile 43594"
)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

宣告ˉ结束 2025-02-14 14:29:18

Stringr软件包中有一个非常简单的Word函数,而无需使用Regex。

library(stringr)

stringr::word(strains, start = 3, end = 3)
 [1] "APC678" "DSM"    "B677"   "Reuter" "F275"   "L917"   "4357"  
 [8] "B1279"  "subsp." "43594" 

There's a very simple word function from the stringr package for this without the need to use regex.

library(stringr)

stringr::word(strains, start = 3, end = 3)
 [1] "APC678" "DSM"    "B677"   "Reuter" "F275"   "L917"   "4357"  
 [8] "B1279"  "subsp." "43594" 
软糯酥胸 2025-02-14 14:29:18

您可以使用Stringr软件包:

stringr::str_split(strains, " ", simplify = TRUE)[,3]

You can use stringr package:

stringr::str_split(strains, " ", simplify = TRUE)[,3]
仅冇旳回忆 2025-02-14 14:29:18

使用基本R和REGEX:

sub("^(\\S+\\s){2}(\\S+).*", "\\2", strains)

使用data.table

data.table::tstrsplit(strains, " ")[[3]]
# [1] "APC678" "DSM"    "B677"   "Reuter" "F275"   "L917"   "4357"   "B1279"  "subsp." "43594"

With Base R and regex:

sub("^(\\S+\\s){2}(\\S+).*", "\\2", strains)

With data.table:

data.table::tstrsplit(strains, " ")[[3]]
# [1] "APC678" "DSM"    "B677"   "Reuter" "F275"   "L917"   "4357"   "B1279"  "subsp." "43594"
残月升风 2025-02-14 14:29:18

基于stringr:Match并捕获组的另一个可能的解决方案:

library(stringr)

str_match(strains, "(\\S+\\s){2}(\\S+).*")[,3]

#>  [1] "APC678" "DSM"    "B677"   "Reuter" "F275"   "L917"   "4357"   "B1279" 
#>  [9] "subsp." "43594"

Another possible solution, based on stringr:match and capture groups:

library(stringr)

str_match(strains, "(\\S+\\s){2}(\\S+).*")[,3]

#>  [1] "APC678" "DSM"    "B677"   "Reuter" "F275"   "L917"   "4357"   "B1279" 
#>  [9] "subsp." "43594"
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文