R - 将文本字符串解析为多列&提取数据值

发布于 2025-01-16 23:09:26 字数 1439 浏览 0 评论 0原文

我有一个如下所示的大型数据集:

IDScores
1English 3, French 7, Geography 8
2Spanish 7, Classics 4
3Physics5, English 5, PE 7, Art 4

我需要从分数中解析文本字符串将每个科目的分数分成单独的列,并将每个人的分数存储为数据值,如下所示:

ID英语法语地理西班牙经典物理体育艺术
1378-----
2---74---
35----574

我无法手动预定义列,因为完整数据集中有 100 个列。到目前为止,我已经清理了数据以删除不一致的大写,并将每个主题标记配对分成不同的列,如下所示:

df$scores2 <- str_to_lower(df$Scores)
split <- separate(
  df,
  scores2,
  into = paste0("Subject", 1:8),
  sep = "\\,",
  remove = FALSE,
  convert = FALSE,
  extra = "warn",
  fill = "warn",
)

我已经查看了有关该主题的多个问题,例如 将不规则文本列拆分为 r 中的多个列,但我找不到列标题和数据的另一种情况值混合在文本字符串中。如何生成所需的全套列,然后填充数据值?

I have a large dataset in the form shown below:

IDScores
1English 3, French 7, Geography 8
2Spanish 7, Classics 4
3Physics 5, English 5, PE 7, Art 4

I need to parse the text string from the Scores column into separate columns for each subject with the scores for each individual stored as the data values, as below:

IDEnglishFrenchGeographySpanishClassicsPhysicsPEArt
1378-----
2---74---
35----574

I cannot manually predefine the columns as there are 100s in the full dataset. So far I have cleaned the data to remove inconsistent capitalisation and separated each subject-mark pairing into a distinct column as follows:

df$scores2 <- str_to_lower(df$Scores)
split <- separate(
  df,
  scores2,
  into = paste0("Subject", 1:8),
  sep = "\\,",
  remove = FALSE,
  convert = FALSE,
  extra = "warn",
  fill = "warn",
)

I have looked at multiple questions on the subject, such as Split irregular text column into multiple columns in r, but I cannot find another case where the column titles and data values are mixed in the text string. How can I generate the full set of columns required and then populate the data value?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

作死小能手 2025-01-23 23:09:26

您可以首先strsplitScores列来拆分主题分数对(将在列表中),然后unnest列表列成行。然后将主题-分数对分离SubjectScore列中。最后将数据从“长”格式转换为“宽”格式。

谢谢@G。 Grothendieck 改进了我的代码:)

library(tidyverse)

df %>% 
  separate_rows(Scores, sep = ", ") %>% 
  separate(Scores, sep = " ", into = c("Subject", "Score")) %>% 
  pivot_wider(names_from = "Subject", values_from = "Score")

# A tibble: 3 × 9
     ID English French Geography Spanish Classics Physics PE    Art  
  <int> <chr>   <chr>  <chr>     <chr>   <chr>    <chr>   <chr> <chr>
1     1 3       7      8         NA      NA       NA      NA    NA   
2     2 NA      NA     NA        7       4        NA      NA    NA   
3     3 5       NA     NA        NA      NA       5       7     4    

You can first strsplit the Scores column to split on subject-score pairs (which would be in a list), then unnest the list-column into rows. Then separate the subject-score pairs into Subject and Score columns. Finally transform the data from a "long" format to a "wide" format.

Thanks @G. Grothendieck for improving my code:)

library(tidyverse)

df %>% 
  separate_rows(Scores, sep = ", ") %>% 
  separate(Scores, sep = " ", into = c("Subject", "Score")) %>% 
  pivot_wider(names_from = "Subject", values_from = "Score")

# A tibble: 3 × 9
     ID English French Geography Spanish Classics Physics PE    Art  
  <int> <chr>   <chr>  <chr>     <chr>   <chr>    <chr>   <chr> <chr>
1     1 3       7      8         NA      NA       NA      NA    NA   
2     2 NA      NA     NA        7       4        NA      NA    NA   
3     3 5       NA     NA        NA      NA       5       7     4    
拧巴小姐 2025-01-23 23:09:26

使用 data.table

library(data.table)

setDT(dt)

dt <- dt[, .(class_grade = unlist(str_split(Scores, ", "))), by = ID]
dt[, c("class", "grade") := tstrsplit(class_grade, " ")]
dcast(dt, ID ~ class, value.var = c("grade"), sep = "")

结果

#    ID  Art Classics English French Geography   PE Physics Spanish
# 1:  1 <NA>     <NA>       3      7         8 <NA>    <NA>    <NA>
# 2:  2 <NA>        4    <NA>   <NA>      <NA> <NA>    <NA>       7
# 3:  3    4     <NA>       5   <NA>      <NA>    7       5    <NA>

数据

dt <- structure(list(ID = 1:3, Scores = c("English 3, French 7, Geography 8", 
"Spanish 7, Classics 4", "Physics 5, English 5, PE 7, Art 4")), row.names = c(NA, 
-3L), class = c("data.frame"))

Using data.table

library(data.table)

setDT(dt)

dt <- dt[, .(class_grade = unlist(str_split(Scores, ", "))), by = ID]
dt[, c("class", "grade") := tstrsplit(class_grade, " ")]
dcast(dt, ID ~ class, value.var = c("grade"), sep = "")

Results

#    ID  Art Classics English French Geography   PE Physics Spanish
# 1:  1 <NA>     <NA>       3      7         8 <NA>    <NA>    <NA>
# 2:  2 <NA>        4    <NA>   <NA>      <NA> <NA>    <NA>       7
# 3:  3    4     <NA>       5   <NA>      <NA>    7       5    <NA>

Data

dt <- structure(list(ID = 1:3, Scores = c("English 3, French 7, Geography 8", 
"Spanish 7, Classics 4", "Physics 5, English 5, PE 7, Art 4")), row.names = c(NA, 
-3L), class = c("data.frame"))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文