R - 将文本字符串解析为多列&提取数据值

发布于 2025-01-16 23:09:26 字数 1439 浏览 0 评论 0原文

我有一个如下所示的大型数据集：

ID	Scores
1	English 3, French 7, Geography 8
2	Spanish 7, Classics 4
3Physics	5, English 5, PE 7, Art 4

我需要从分数中解析文本字符串将每个科目的分数分成单独的列，并将每个人的分数存储为数据值，如下所示：

ID	英语	法语	地理	西班牙	经典	物理	体育	艺术
1	3	7	8	-	-	-	-	-
2	-	-	-	7	4	-	-	-
3	5	-	-	-	-	5	7	4

我无法手动预定义列，因为完整数据集中有 100 个列。到目前为止，我已经清理了数据以删除不一致的大写，并将每个主题标记配对分成不同的列，如下所示：

df$scores2 <- str_to_lower(df$Scores)
split <- separate(
  df,
  scores2,
  into = paste0("Subject", 1:8),
  sep = "\\,",
  remove = FALSE,
  convert = FALSE,
  extra = "warn",
  fill = "warn",
)

我已经查看了有关该主题的多个问题，例如将不规则文本列拆分为 r 中的多个列，但我找不到列标题和数据的另一种情况值混合在文本字符串中。如何生成所需的全套列，然后填充数据值？

原文

I have a large dataset in the form shown below:

ID	Scores
1	English 3, French 7, Geography 8
2	Spanish 7, Classics 4
3	Physics 5, English 5, PE 7, Art 4

I need to parse the text string from the Scores column into separate columns for each subject with the scores for each individual stored as the data values, as below:

ID	English	French	Geography	Spanish	Classics	Physics	PE	Art
1	3	7	8	-	-	-	-	-
2	-	-	-	7	4	-	-	-
3	5	-	-	-	-	5	7	4

I cannot manually predefine the columns as there are 100s in the full dataset. So far I have cleaned the data to remove inconsistent capitalisation and separated each subject-mark pairing into a distinct column as follows:

df$scores2 <- str_to_lower(df$Scores)
split <- separate(
  df,
  scores2,
  into = paste0("Subject", 1:8),
  sep = "\\,",
  remove = FALSE,
  convert = FALSE,
  extra = "warn",
  fill = "warn",
)

I have looked at multiple questions on the subject, such as Split irregular text column into multiple columns in r, but I cannot find another case where the column titles and data values are mixed in the text string. How can I generate the full set of columns required and then populate the data value?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

作死小能手 2025-01-23 23:09:26

您可以首先strsplitScores列来拆分主题分数对（将在列表中），然后unnest列表列成行。然后将主题-分数对分离到Subject和Score列中。最后将数据从“长”格式转换为“宽”格式。

谢谢@G。 Grothendieck 改进了我的代码:)

library(tidyverse)

df %>% 
  separate_rows(Scores, sep = ", ") %>% 
  separate(Scores, sep = " ", into = c("Subject", "Score")) %>% 
  pivot_wider(names_from = "Subject", values_from = "Score")

# A tibble: 3 × 9
     ID English French Geography Spanish Classics Physics PE    Art  
  <int> <chr>   <chr>  <chr>     <chr>   <chr>    <chr>   <chr> <chr>
1     1 3       7      8         NA      NA       NA      NA    NA   
2     2 NA      NA     NA        7       4        NA      NA    NA   
3     3 5       NA     NA        NA      NA       5       7     4

You can first strsplit the Scores column to split on subject-score pairs (which would be in a list), then unnest the list-column into rows. Then separate the subject-score pairs into Subject and Score columns. Finally transform the data from a "long" format to a "wide" format.

Thanks @G. Grothendieck for improving my code:)

library(tidyverse)

df %>% 
  separate_rows(Scores, sep = ", ") %>% 
  separate(Scores, sep = " ", into = c("Subject", "Score")) %>% 
  pivot_wider(names_from = "Subject", values_from = "Score")

# A tibble: 3 × 9
     ID English French Geography Spanish Classics Physics PE    Art  
  <int> <chr>   <chr>  <chr>     <chr>   <chr>    <chr>   <chr> <chr>
1     1 3       7      8         NA      NA       NA      NA    NA   
2     2 NA      NA     NA        7       4        NA      NA    NA   
3     3 5       NA     NA        NA      NA       5       7     4

回复收藏 0 原文

拧巴小姐 2025-01-23 23:09:26

使用 data.table

library(data.table)

setDT(dt)

dt <- dt[, .(class_grade = unlist(str_split(Scores, ", "))), by = ID]
dt[, c("class", "grade") := tstrsplit(class_grade, " ")]
dcast(dt, ID ~ class, value.var = c("grade"), sep = "")

结果

#    ID  Art Classics English French Geography   PE Physics Spanish
# 1:  1 <NA>     <NA>       3      7         8 <NA>    <NA>    <NA>
# 2:  2 <NA>        4    <NA>   <NA>      <NA> <NA>    <NA>       7
# 3:  3    4     <NA>       5   <NA>      <NA>    7       5    <NA>

数据

dt <- structure(list(ID = 1:3, Scores = c("English 3, French 7, Geography 8", 
"Spanish 7, Classics 4", "Physics 5, English 5, PE 7, Art 4")), row.names = c(NA, 
-3L), class = c("data.frame"))

Using data.table

library(data.table)

setDT(dt)

dt <- dt[, .(class_grade = unlist(str_split(Scores, ", "))), by = ID]
dt[, c("class", "grade") := tstrsplit(class_grade, " ")]
dcast(dt, ID ~ class, value.var = c("grade"), sep = "")

Results

#    ID  Art Classics English French Geography   PE Physics Spanish
# 1:  1 <NA>     <NA>       3      7         8 <NA>    <NA>    <NA>
# 2:  2 <NA>        4    <NA>   <NA>      <NA> <NA>    <NA>       7
# 3:  3    4     <NA>       5   <NA>      <NA>    7       5    <NA>

Data

dt <- structure(list(ID = 1:3, Scores = c("English 3, French 7, Geography 8", 
"Spanish 7, Classics 4", "Physics 5, English 5, PE 7, Art 4")), row.names = c(NA, 
-3L), class = c("data.frame"))

回复收藏 0 原文

~没有更多了~