如何使用较低的情况敏感性作为因素和外观功能将R分开R中的列

发布于 2025-02-03 06:16:49 字数 720 浏览 4 评论 0原文

我在R中有一个大型数据框,其中由单列中的较低情况和大写字母组成。

df1 <- data.frame(a = c('GCCTTGATTTTTTGGCGGGGACCGTcatGGCGTCGC', 'GATTTTTTGGCGGGGACCGTcatGGCGTCGC', 'TCACCACCATCtCATTCTGC', 'ACTGGTTCCAcCAGCGGGTCACGAC'), 
                  stringsAsFactors = FALSE)

我希望输出将所有“上案字母”带到任何下部案例字母的左侧;即,类似于外观功能的功能。

例如,

GCCTTGATTTTTTTTTTTTGGGGGGACCGTCATGGCGTCGC将成为Gccttgattttttttttgggggggggt gattttttttggggggggcgtcatggcgtcgc将变成gatttttttgggggggaCggt ACTGGTTCCACCAGGGGTCACGAC将成为ActGGTTCCA,

我只对较低案例字符的第一个实例的左侧的大写字符感兴趣。如果没有较低的情况,我也希望代码不掉落。

我尝试查看: case 分裂字符串 但是我似乎无法将其调整以寻找上层案例。

非常感谢您的帮助。

I have a large dataframe in R that is comprised of lower case and uppercase letters in a single column.

df1 <- data.frame(a = c('GCCTTGATTTTTTGGCGGGGACCGTcatGGCGTCGC', 'GATTTTTTGGCGGGGACCGTcatGGCGTCGC', 'TCACCACCATCtCATTCTGC', 'ACTGGTTCCAcCAGCGGGTCACGAC'), 
                  stringsAsFactors = FALSE)

I would like the output to take all of the 'upper case letters' to the left of any lower case letters; i.e., something similar to a look-behind feature.

For example

GCCTTGATTTTTTGGCGGGGACCGTcatGGCGTCGC would become GCCTTGATTTTTTGGCGGGGACCGT
GATTTTTTGGCGGGGACCGTcatGGCGTCGC would become GATTTTTTGGCGGGGACCGT
ACTGGTTCCAcCAGCGGGTCACGAC would become ACTGGTTCCA

I am only interested in the upper case characters to the left hand side of the first instance of lower case characters. I would like also for the code to not fall over if there is no instance of lower case.

I have tried looking at: Splitting strings by case
but i cannot seem to adapt it to look behind for upper case.

I really thank you in advance for your help.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

罪#恶を代价 2025-02-10 06:16:49

代码:

library(tidyverse)

df1 <- data.frame(a = c('GCCTTGATTTTTTGGCGGGGACCGTcatGGCGTCGC', 'GATTTTTTGGCGGGGACCGTcatGGCGTCGC', 'TCACCACCATCtCATTCTGC', 'ACTGGTTCCAcCAGCGGGTCACGAC', 'BAARA'), 
                  stringsAsFactors = FALSE)
df1


df1$a <- str_trim(str_extract(df1$a , "([:upper:]|[:space:]){2,}"))
df1

输出:

                          a
1 GCCTTGATTTTTTGGCGGGGACCGT
2      GATTTTTTGGCGGGGACCGT
3               TCACCACCATC
4                ACTGGTTCCA
5                     BAARA    #This one not having any lower case charater from the begining

放置Na,其中字符串没有任何较低的案例字符。

 for (i in 1 :nrow(df1)){  
    if(is.na(str_extract(df1[i,'a'], "([:lower:]|[:space:]){1,}"))) 
       {df1[i,'a'] <- NA}
    else 
       {df1[i,'a'] <- str_trim(str_extract(df1[i,'a'] , "([:upper:]|[:space:]){2,}"))}
     df1[i,'b'] <- df1[i,'a']   
    }
 df1

输出:

                          a
1 GCCTTGATTTTTTGGCGGGGACCGT
2      GATTTTTTGGCGGGGACCGT
3               TCACCACCATC
4                ACTGGTTCCA
5                      <NA>

Code:

library(tidyverse)

df1 <- data.frame(a = c('GCCTTGATTTTTTGGCGGGGACCGTcatGGCGTCGC', 'GATTTTTTGGCGGGGACCGTcatGGCGTCGC', 'TCACCACCATCtCATTCTGC', 'ACTGGTTCCAcCAGCGGGTCACGAC', 'BAARA'), 
                  stringsAsFactors = FALSE)
df1


df1$a <- str_trim(str_extract(df1$a , "([:upper:]|[:space:]){2,}"))
df1

Output:

                          a
1 GCCTTGATTTTTTGGCGGGGACCGT
2      GATTTTTTGGCGGGGACCGT
3               TCACCACCATC
4                ACTGGTTCCA
5                     BAARA    #This one not having any lower case charater from the begining

Putting NA, where the string dont have any lower cases charaters.

 for (i in 1 :nrow(df1)){  
    if(is.na(str_extract(df1[i,'a'], "([:lower:]|[:space:]){1,}"))) 
       {df1[i,'a'] <- NA}
    else 
       {df1[i,'a'] <- str_trim(str_extract(df1[i,'a'] , "([:upper:]|[:space:]){2,}"))}
     df1[i,'b'] <- df1[i,'a']   
    }
 df1

Output:

                          a
1 GCCTTGATTTTTTGGCGGGGACCGT
2      GATTTTTTGGCGGGGACCGT
3               TCACCACCATC
4                ACTGGTTCCA
5                      <NA>
小镇女孩 2025-02-10 06:16:49
sub("([A-Z]+)[a-z].*", "\\1", df1$a)

# [1] "GCCTTGATTTTTTGGCGGGGACCGTatGGCGTCGC"
# [2] "GATTTTTTGGCGGGGACCGTatGGCGTCGC"     
# [3] "TCACCACCATCCATTCTGC"                
# [4] "ACTGGTTCCACAGCGGGTCACGAC"
sub("([A-Z]+)[a-z].*", "\\1", df1$a)

# [1] "GCCTTGATTTTTTGGCGGGGACCGTatGGCGTCGC"
# [2] "GATTTTTTGGCGGGGACCGTatGGCGTCGC"     
# [3] "TCACCACCATCCATTCTGC"                
# [4] "ACTGGTTCCACAGCGGGTCACGAC"
网白 2025-02-10 06:16:49

您可以将sub[Az]。后。

sub("[a-z].*", "", df1$a)
#sub("[[:lower:]].*", "", df1$a) #Alternative
#[1] "GCCTTGATTTTTTGGCGGGGACCGT" "GATTTTTTGGCGGGGACCGT"     
#[3] "TCACCACCATC"               "ACTGGTTCCA"               

设置为NA没有较低情况:

df1 <- rbind(df1, "ABC")               #Add without lower case
is.na(df1$a) <- !grepl("[a-z]", df1$a) #set NA where no lower case
sub("[a-z].*", "", df1$a)
#[1] "GCCTTGATTTTTTGGCGGGGACCGT" "GATTTTTTGGCGGGGACCGT"     
#[3] "TCACCACCATC"               "ACTGGTTCCA"               
#[5] NA                         

You can use sub with [a-z].* or [[:lower:]].* to remove the first lower case letter and everything after.

sub("[a-z].*", "", df1$a)
#sub("[[:lower:]].*", "", df1$a) #Alternative
#[1] "GCCTTGATTTTTTGGCGGGGACCGT" "GATTTTTTGGCGGGGACCGT"     
#[3] "TCACCACCATC"               "ACTGGTTCCA"               

Set to NA where there is no lower case:

df1 <- rbind(df1, "ABC")               #Add without lower case
is.na(df1$a) <- !grepl("[a-z]", df1$a) #set NA where no lower case
sub("[a-z].*", "", df1$a)
#[1] "GCCTTGATTTTTTGGCGGGGACCGT" "GATTTTTTGGCGGGGACCGT"     
#[3] "TCACCACCATC"               "ACTGGTTCCA"               
#[5] NA                         
不可一世的女人 2025-02-10 06:16:49

您可以使用带有正面lookahead Regex的代码行(将所有内容捕获到第一个较低的情况),因此您无需处理na's。是否有比赛。

stringr::str_extract(df1$a, ".+?(?=[a-z])")

#[1] "GCCTTGATTTTTTGGCGGGGACCGT" "GATTTTTTGGCGGGGACCGT"      "TCACCACCATC"              
#[4] "ACTGGTTCCA"                NA    

按照评论中的要求添加新列B中的结果:

df1 |> dplyr::mutate(b = stringr::str_extract(a, ".+?(?=[a-z])"))

#                                      a                         b
# 1 GCCTTGATTTTTTGGCGGGGACCGTcatGGCGTCGC GCCTTGATTTTTTGGCGGGGACCGT
# 2      GATTTTTTGGCGGGGACCGTcatGGCGTCGC      GATTTTTTGGCGGGGACCGT
# 3                 TCACCACCATCtCATTCTGC               TCACCACCATC
# 4            ACTGGTTCCAcCAGCGGGTCACGAC                ACTGGTTCCA
# 5                                BAARA                      <NA>

You can do it all with a line of code with a positive lookahead regex (capturing everything up to the first lower case), so you don't need to deal with the NA's. Either there is a match or not.

stringr::str_extract(df1$a, ".+?(?=[a-z])")

#[1] "GCCTTGATTTTTTGGCGGGGACCGT" "GATTTTTTGGCGGGGACCGT"      "TCACCACCATC"              
#[4] "ACTGGTTCCA"                NA    

To add a new column b with the result as asked in the comments:

df1 |> dplyr::mutate(b = stringr::str_extract(a, ".+?(?=[a-z])"))

#                                      a                         b
# 1 GCCTTGATTTTTTGGCGGGGACCGTcatGGCGTCGC GCCTTGATTTTTTGGCGGGGACCGT
# 2      GATTTTTTGGCGGGGACCGTcatGGCGTCGC      GATTTTTTGGCGGGGACCGT
# 3                 TCACCACCATCtCATTCTGC               TCACCACCATC
# 4            ACTGGTTCCAcCAGCGGGTCACGAC                ACTGGTTCCA
# 5                                BAARA                      <NA>
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文