R:在3列中提取值和插入

发布于 2025-01-24 15:48:32 字数 2775 浏览 3 评论 0原文

我有一个大型数据集,例如以下内容,并且我正在尝试根据列国家 /地区将值添加到3列。

Country<-c("Asia","Africa - Benin (Cotonou)",
           "Europe - France (Paris)","Asia - China(Shanghai)", "Europe - United Kingdom (London)", "Europe - France (Orléans)"
           , "Afrique - Togo (Lomé)", "Afrique - Sénégal (Dakar)", "Asia - Pakistan (Rahim Yar Khan)")

ID<-c(1,2,3,4,5,6,7,8,9)
mydata<-data.frame(ID,Country)


 > mydata
>   ID                          Country         col1     col2     col3 
> 1  1                             Asia
> 2  2         Africa - Benin (Cotonou)
> 3  3          Europe - France (Paris)
> 4  4           Asia - China(Shanghai)
> 5  5 Europe - United Kingdom (London)
> 6  6        Europe - France (Orléans)
> 7  7            Afrique - Togo (Lomé)
> 8  8        Afrique - Sénégal (Dakar)
> 9  9 Asia - Pakistan (Rahim Yar Khan)

我尝试了以下操作,但是我对正则表达式有问题,

library(tidyr)
mydata <- mydata %>% separate(col = "Country", into = c("Col1", "Col2", "Col3"), remove = FALSE, fill = "right")
     

我得到的结果如下:

ID     Country                          Col1           Col2     Col3
 1    Asia                              Asia           <NA>     <NA>
 2    Africa - Benin (Cotonou)          Africa         Benin  Cotonou
 3    Europe - France (Paris)           Europe         France  Paris
 4    Asia - China(Shanghai)            Asia           China   Shanghai
 5    Europe - United Kingdom (London)  Europe         United  Kingdom
 6    Europe - France (Orléans)         Europe         France  Orl
 7    Afrique - Togo (Lomé)             Afrique        Togo      L
 8     Afrique - Sénégal (Dakar)        Afrique        S         n
 9 Asia - Pakistan (Rahim Yar Khan)     Asia           Pakistan   Rahim

第3列,第5,6,7,8和9列中缺少某些零件。

想要的结果是:

ID     Country                          Col1           Col2                Col3
     1    Asia                              Asia           <NA>            <NA>
     2    Africa - Benin (Cotonou)          Africa         Benin            Cotonou
     3    Europe - France (Paris)           Europe         France           Paris
     4    Asia - China(Shanghai)            Asia           China            Shanghai
     5    Europe - United Kingdom (London)  Europe         United Kingdom    London
     6    Europe - France (Orléans)         Europe         France            Orléans
     7    Afrique - Togo (Lomé)             Afrique        Togo              Lomé
     8     Afrique - Sénégal (Dakar)        Afrique        Sénégal           Dakar
     9 Asia - Pakistan (Rahim Yar Khan)     Asia           Pakistan          Rahim Yar Khan

我 关于如何这样做的建议?

I have a large dataset like the following and I'm trying to add value to 3 columns based on column Country.

Country<-c("Asia","Africa - Benin (Cotonou)",
           "Europe - France (Paris)","Asia - China(Shanghai)", "Europe - United Kingdom (London)", "Europe - France (Orléans)"
           , "Afrique - Togo (Lomé)", "Afrique - Sénégal (Dakar)", "Asia - Pakistan (Rahim Yar Khan)")

ID<-c(1,2,3,4,5,6,7,8,9)
mydata<-data.frame(ID,Country)


 > mydata
>   ID                          Country         col1     col2     col3 
> 1  1                             Asia
> 2  2         Africa - Benin (Cotonou)
> 3  3          Europe - France (Paris)
> 4  4           Asia - China(Shanghai)
> 5  5 Europe - United Kingdom (London)
> 6  6        Europe - France (Orléans)
> 7  7            Afrique - Togo (Lomé)
> 8  8        Afrique - Sénégal (Dakar)
> 9  9 Asia - Pakistan (Rahim Yar Khan)

I tried the following but im having issue with the regular expression

library(tidyr)
mydata <- mydata %>% separate(col = "Country", into = c("Col1", "Col2", "Col3"), remove = FALSE, fill = "right")
     

The result that I get is the following:

ID     Country                          Col1           Col2     Col3
 1    Asia                              Asia           <NA>     <NA>
 2    Africa - Benin (Cotonou)          Africa         Benin  Cotonou
 3    Europe - France (Paris)           Europe         France  Paris
 4    Asia - China(Shanghai)            Asia           China   Shanghai
 5    Europe - United Kingdom (London)  Europe         United  Kingdom
 6    Europe - France (Orléans)         Europe         France  Orl
 7    Afrique - Togo (Lomé)             Afrique        Togo      L
 8     Afrique - Sénégal (Dakar)        Afrique        S         n
 9 Asia - Pakistan (Rahim Yar Khan)     Asia           Pakistan   Rahim

Some part are missing in column 3, row 5,6,7,8 and 9.

the result that i want is the following:

ID     Country                          Col1           Col2                Col3
     1    Asia                              Asia           <NA>            <NA>
     2    Africa - Benin (Cotonou)          Africa         Benin            Cotonou
     3    Europe - France (Paris)           Europe         France           Paris
     4    Asia - China(Shanghai)            Asia           China            Shanghai
     5    Europe - United Kingdom (London)  Europe         United Kingdom    London
     6    Europe - France (Orléans)         Europe         France            Orléans
     7    Afrique - Togo (Lomé)             Afrique        Togo              Lomé
     8     Afrique - Sénégal (Dakar)        Afrique        Sénégal           Dakar
     9 Asia - Pakistan (Rahim Yar Khan)     Asia           Pakistan          Rahim Yar Khan

Any suggestion on how to do this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

凉城凉梦凉人心 2025-01-31 15:48:32

这是我的第一个贡献,所以如果我错了,请原谅我。
我这样做了,可能不是最简单的方法,但我想它有效:

mydata %>% 
  separate(col = "Country",
           sep = "[\\(-]",
           into = c("Col1", "Col2", "Col3"),
           remove = FALSE,
           fill = "right") %>% 
  mutate(Col3 = str_remove(Col3, "\\)"))

this is my first contribution so please forgive me if I am wrong.
I did it this way, may not be the easiest way but I guess it worked:

mydata %>% 
  separate(col = "Country",
           sep = "[\\(-]",
           into = c("Col1", "Col2", "Col3"),
           remove = FALSE,
           fill = "right") %>% 
  mutate(Col3 = str_remove(Col3, "\\)"))
十年九夏 2025-01-31 15:48:32

更新:要删除额外的空格,我们可以在代码末尾添加此行:
突变(跨越(start_with(“ col”),str_squish))

我们可以替换第一个分隔符- 然后)
我们有一个隔离器。
后词做单独,最后删除剩余的

library(dplyr)
library(stringr)
library(tidyr)

  ID    col1           col2           col3
1  1    Asia           <NA>           <NA>
2  2  Africa          Benin        Cotonou
3  3  Europe         France          Paris
4  4    Asia          China       Shanghai
5  5  Europe United Kingdom         London
6  6  Europe         France        Orléans
7  7 Afrique           Togo           Lomé
8  8 Afrique        Sénégal          Dakar
9  9    Asia       Pakistan Rahim Yar Khan

Update: to remove the extra spaces we could add this line at the end of the code:
mutate(across(starts_with("col"), str_squish))

We could replace the first separator - by ( then
we get one separtor .
Afterwords do separate and finally remove the remaining )

library(dplyr)
library(stringr)
library(tidyr)

  ID    col1           col2           col3
1  1    Asia           <NA>           <NA>
2  2  Africa          Benin        Cotonou
3  3  Europe         France          Paris
4  4    Asia          China       Shanghai
5  5  Europe United Kingdom         London
6  6  Europe         France        Orléans
7  7 Afrique           Togo           Lomé
8  8 Afrique        Sénégal          Dakar
9  9    Asia       Pakistan Rahim Yar Khan
枫林﹌晚霞¤ 2025-01-31 15:48:32

Tidyr ::独立将基于定界符(默认情况下是任何非alpha-numeric)将文本分开为列,因此默认情况下它在空格上分开。您可以使用额外的参数将所有剩余的文本合并到第三列中:

mydata %>% 
    separate(Country, 
            into = c("Col1", "Col2", "Col3"),
            extra = "merge")
  ID    Col1     Col2             Col3
1  1    Asia     <NA>             <NA>
2  2  Africa    Benin         Cotonou)
3  3  Europe   France           Paris)
4  4    Asia    China        Shanghai)
5  5  Europe   United Kingdom (London)
6  6  Europe   France         Orléans)
7  7 Afrique     Togo            Lomé)
8  8 Afrique  Sénégal           Dakar)
9  9    Asia Pakistan  Rahim Yar Khan)
Warning message:
Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1]. 

但是,最后我们会得到不必要的)。您可以通过突变删除它,也可以代替单独使用tidyr :: dricter允许基于REGEX提取:

mydata %>% 
    extract(Country, 
            into = c("Col1", "Col2", "Col3"),
            regex = "([[:alnum:]]+) - ([[:alnum:]]+) ?\\((.*)\\)")
  ID    Col1     Col2           Col3
1  1    <NA>     <NA>           <NA>
2  2  Africa    Benin        Cotonou
3  3  Europe   France          Paris
4  4    Asia    China       Shanghai
5  5    <NA>     <NA>           <NA>
6  6  Europe   France        Orléans
7  7 Afrique     Togo           Lomé
8  8 Afrique  Sénégal          Dakar
9  9    Asia Pakistan Rahim Yar Khan

tidyr::separate will separate text into columns based on delimiter (which is by default any non alpha-numeric) so it separates on spaces by default. You can use the extra argument to merge all the remaining text into the 3rd column like so:

mydata %>% 
    separate(Country, 
            into = c("Col1", "Col2", "Col3"),
            extra = "merge")
  ID    Col1     Col2             Col3
1  1    Asia     <NA>             <NA>
2  2  Africa    Benin         Cotonou)
3  3  Europe   France           Paris)
4  4    Asia    China        Shanghai)
5  5  Europe   United Kingdom (London)
6  6  Europe   France         Orléans)
7  7 Afrique     Togo            Lomé)
8  8 Afrique  Sénégal           Dakar)
9  9    Asia Pakistan  Rahim Yar Khan)
Warning message:
Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1]. 

However, with this we get an unnecessary ) at the end. You can either remove this via a mutate or instead of separate use tidyr::extract that allows extracting based on regex:

mydata %>% 
    extract(Country, 
            into = c("Col1", "Col2", "Col3"),
            regex = "([[:alnum:]]+) - ([[:alnum:]]+) ?\\((.*)\\)")
  ID    Col1     Col2           Col3
1  1    <NA>     <NA>           <NA>
2  2  Africa    Benin        Cotonou
3  3  Europe   France          Paris
4  4    Asia    China       Shanghai
5  5    <NA>     <NA>           <NA>
6  6  Europe   France        Orléans
7  7 Afrique     Togo           Lomé
8  8 Afrique  Sénégal          Dakar
9  9    Asia Pakistan Rahim Yar Khan
擦肩而过的背影 2025-01-31 15:48:32
library(dplyr)
library(tidyr)

mydata %>%
  separate(Country, into = c("col1", "col2", "col3"), '( - | ?\\()', remove = FALSE) %>%
  mutate(col3 = gsub(')', '', col3))

#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1].
#>   ID                          Country    col1           col2           col3
#> 1  1                             Asia    Asia           <NA>           <NA>
#> 2  2         Africa - Benin (Cotonou)  Africa          Benin        Cotonou
#> 3  3          Europe - France (Paris)  Europe         France          Paris
#> 4  4           Asia - China(Shanghai)    Asia          China       Shanghai
#> 5  5 Europe - United Kingdom (London)  Europe United Kingdom         London
#> 6  6        Europe - France (Orléans)  Europe         France        Orléans
#> 7  7            Afrique - Togo (Lomé) Afrique           Togo           Lomé
#> 8  8        Afrique - Sénégal (Dakar) Afrique        Sénégal          Dakar
#> 9  9 Asia - Pakistan (Rahim Yar Khan)    Asia       Pakistan Rahim Yar Khan
library(dplyr)
library(tidyr)

mydata %>%
  separate(Country, into = c("col1", "col2", "col3"), '( - | ?\\()', remove = FALSE) %>%
  mutate(col3 = gsub(')', '', col3))

#> Warning: Expected 3 pieces. Missing pieces filled with `NA` in 1 rows [1].
#>   ID                          Country    col1           col2           col3
#> 1  1                             Asia    Asia           <NA>           <NA>
#> 2  2         Africa - Benin (Cotonou)  Africa          Benin        Cotonou
#> 3  3          Europe - France (Paris)  Europe         France          Paris
#> 4  4           Asia - China(Shanghai)    Asia          China       Shanghai
#> 5  5 Europe - United Kingdom (London)  Europe United Kingdom         London
#> 6  6        Europe - France (Orléans)  Europe         France        Orléans
#> 7  7            Afrique - Togo (Lomé) Afrique           Togo           Lomé
#> 8  8        Afrique - Sénégal (Dakar) Afrique        Sénégal          Dakar
#> 9  9 Asia - Pakistan (Rahim Yar Khan)    Asia       Pakistan Rahim Yar Khan
绳情 2025-01-31 15:48:32

data.table解决方案:

require(data.table)
setDT(mydata)

splitCountry <- function( c_str ) {
  
  vec <- trimws(unlist(strsplit(as.character(c_str),"[[:punct:]]")))
  col1 <- vec[1]
  col2 <- vec[2]
  col3 <- vec[3]
  
  return(list(col1,
              col2,
              col3))
  
}

mydata[,c('col1','col2','col3'):=splitCountry(Country),by=Country]

A data.table solution:

require(data.table)
setDT(mydata)

splitCountry <- function( c_str ) {
  
  vec <- trimws(unlist(strsplit(as.character(c_str),"[[:punct:]]")))
  col1 <- vec[1]
  col2 <- vec[2]
  col3 <- vec[3]
  
  return(list(col1,
              col2,
              col3))
  
}

mydata[,c('col1','col2','col3'):=splitCountry(Country),by=Country]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文