R:通过公共 ID 合并列,任何行上没有任何 NA?
给定具有缺失值的数据,插补是用某些值替换缺失值的过程。目标是忽略具有缺失值的行(用 NA 表示)。此类行可以被视为数据的组成部分,因此该过程称为“项目插补”。
输入
df1 <- data.frame(ID=c(1,2,5,6),V1=c(7,77,777,NA))
df2 <- data.frame(ID=c(1,3,5,6),V2=c(6,66,666,6666))
df3 <- data.frame(ID=c(1,3,5,6),V3=c(9,NA,999,9999))
或采用 CSV 格式,其中缺失值用 NA 标记
data.csv data2.csv data3.csv
ID V1 ID V2 ID V3
1 7 1 6 1 9
2 77 2 NA 2 NA
3 NA 3 66 3 NA
4 NA 4 NA 4 NA
5 777 5 666 5 999
6 NA 6 6666 6 9999
输出
预期结果是
ID V1 V2 V3
1 7 6 9
5 777 666 999
我们只想要没有任何 NA 值的行。
如何将输入数据与列 V1、V2、V3 和行上没有 NA 的公共列 ID 合并?
使用 SQLDF 合并具有公共 ID 且没有 NA 的列的示例解决方案< /strong>
library(sqldf)
# Read in the data: with CSV, you can use read.csv or fread from data.table
df1 <- data.frame(ID=c(1,2,5,6),V1=c(7,77,777,NA))
df2 <- data.frame(ID=c(1,3,5,6),V2=c(6,66,666,6666))
df3 <- data.frame(ID=c(1,3,5,6),V3=c(9,NA,999,9999))
#
sqldf("SELECT a.ID, a.V1, b.V2, c.V3 FROM df1 a, df2 b, df3 c WHERE a.ID=b.ID AND b.ID=c.ID AND V1!='NA'")
导致
ID V1 V2 V3
1 1 7 6 9
2 5 777 666 999
Given a data with missing values, imputation is a process where the missing values are substituted with some values. The goal is to ignore the rows with missing values, denoted with NAs. Such row could be seen as a component of the data hence the process called item imputation.
Input
df1 <- data.frame(ID=c(1,2,5,6),V1=c(7,77,777,NA))
df2 <- data.frame(ID=c(1,3,5,6),V2=c(6,66,666,6666))
df3 <- data.frame(ID=c(1,3,5,6),V3=c(9,NA,999,9999))
or alternatively in CSV format where misings values are marked with NAs
data.csv data2.csv data3.csv
ID V1 ID V2 ID V3
1 7 1 6 1 9
2 77 2 NA 2 NA
3 NA 3 66 3 NA
4 NA 4 NA 4 NA
5 777 5 666 5 999
6 NA 6 6666 6 9999
Output
Expected result is
ID V1 V2 V3
1 7 6 9
5 777 666 999
where we we wanted just lines without any NA value.
How to merge the input data with columns V1, V2, V3 and a common column ID with no NA on a row?
Example solution with SQLDF to merge the columns with common ID and no NA
library(sqldf)
# Read in the data: with CSV, you can use read.csv or fread from data.table
df1 <- data.frame(ID=c(1,2,5,6),V1=c(7,77,777,NA))
df2 <- data.frame(ID=c(1,3,5,6),V2=c(6,66,666,6666))
df3 <- data.frame(ID=c(1,3,5,6),V3=c(9,NA,999,9999))
#
sqldf("SELECT a.ID, a.V1, b.V2, c.V3 FROM df1 a, df2 b, df3 c WHERE a.ID=b.ID AND b.ID=c.ID AND V1!='NA'")
resulting to
ID V1 V2 V3
1 1 7 6 9
2 5 777 666 999
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这是一个仅限 R 的基本版本,它不关心有多少合并。假设数据框位于列表
l
中 - 请参阅对 Q 的编辑以获取该格式的示例数据:这给出了所需的输出
要读取数据,类似这样的东西应该可以工作
,或者如果它们确实是 CSV
然后就可以使用上面的代码来处理了。
Here is a base R only version, that doesn't care how many merges there are. Assumes the data frames are in list
l
- see edit to Q for the example data in that format:which gives the desired output
To read the data in, something like this should work
or if they really are CSV
then you can use the code above to process.
出于对这个问题中俳句艺术形式的尊重,我将提供以下答案/疯狂猜测:
请注意,这假设您实际上想要 ID==1。
感谢 @Joris 提供的
merge_recurse
提示。Out of respect for the artform of haiku on fine display in this question, I'm going to provide the following answer/wild-guess:
Note this assumes you actually want ID==1.
Thanks to @Joris for the
merge_recurse
tip.