R - 将文本字符串解析为多列&提取数据值
我有一个如下所示的大型数据集:
ID | Scores |
---|---|
1 | English 3, French 7, Geography 8 |
2 | Spanish 7, Classics 4 |
3Physics | 5, English 5, PE 7, Art 4 |
我需要从分数中解析文本字符串将每个科目的分数分成单独的列,并将每个人的分数存储为数据值,如下所示:
ID | 英语 | 法语 | 地理 | 西班牙 | 经典 | 物理 | 体育 | 艺术 |
---|---|---|---|---|---|---|---|---|
1 | 3 | 7 | 8 | - | - | - | - | - |
2 | - | - | - | 7 | 4 | - | - | - |
3 | 5 | - | - | - | - | 5 | 7 | 4 |
我无法手动预定义列,因为完整数据集中有 100 个列。到目前为止,我已经清理了数据以删除不一致的大写,并将每个主题标记配对分成不同的列,如下所示:
df$scores2 <- str_to_lower(df$Scores)
split <- separate(
df,
scores2,
into = paste0("Subject", 1:8),
sep = "\\,",
remove = FALSE,
convert = FALSE,
extra = "warn",
fill = "warn",
)
我已经查看了有关该主题的多个问题,例如 将不规则文本列拆分为 r 中的多个列,但我找不到列标题和数据的另一种情况值混合在文本字符串中。如何生成所需的全套列,然后填充数据值?
I have a large dataset in the form shown below:
ID | Scores |
---|---|
1 | English 3, French 7, Geography 8 |
2 | Spanish 7, Classics 4 |
3 | Physics 5, English 5, PE 7, Art 4 |
I need to parse the text string from the Scores column into separate columns for each subject with the scores for each individual stored as the data values, as below:
ID | English | French | Geography | Spanish | Classics | Physics | PE | Art |
---|---|---|---|---|---|---|---|---|
1 | 3 | 7 | 8 | - | - | - | - | - |
2 | - | - | - | 7 | 4 | - | - | - |
3 | 5 | - | - | - | - | 5 | 7 | 4 |
I cannot manually predefine the columns as there are 100s in the full dataset. So far I have cleaned the data to remove inconsistent capitalisation and separated each subject-mark pairing into a distinct column as follows:
df$scores2 <- str_to_lower(df$Scores)
split <- separate(
df,
scores2,
into = paste0("Subject", 1:8),
sep = "\\,",
remove = FALSE,
convert = FALSE,
extra = "warn",
fill = "warn",
)
I have looked at multiple questions on the subject, such as Split irregular text column into multiple columns in r, but I cannot find another case where the column titles and data values are mixed in the text string. How can I generate the full set of columns required and then populate the data value?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您可以首先
strsplit
Scores
列来拆分主题分数对(将在列表中),然后unnest
列表列成行。然后将主题-分数对分离
到Subject
和Score
列中。最后将数据从“长”格式转换为“宽”格式。谢谢@G。 Grothendieck 改进了我的代码:)
You can first
strsplit
theScores
column to split on subject-score pairs (which would be in a list), thenunnest
the list-column into rows. Thenseparate
the subject-score pairs intoSubject
andScore
columns. Finally transform the data from a "long" format to a "wide" format.Thanks @G. Grothendieck for improving my code:)
使用 data.table
结果
数据
Using data.table
Results
Data