从数据帧获取值到另一个数据帧

发布于 2024-11-09 04:30:24 字数 853 浏览 2 评论 0原文

我是 R 编程新手,刚刚开始学习它,我需要你帮助我解决这个问题。

我有 2 个数据帧:

the first(df1):

    V1 V2
    A  A 
    A  B 
    A  C 
    B  A 
    B  B 
    B  C 

etc

第二个(df2):

V1  Va   Vb
A   12   23
B   15   53
C   321  543
D   54   325
etc..

使用此代码生成示例数据。

df1 <- data.frame(
  V1 = rep(LETTERS[1:2], each = 3L),
  V2 = rep.int(LETTERS[1:3], 2L)
)
dfr2 <- data.frame(
  Va = c(12, 15, 312, 54),
  Vb = c(23, 53, 543, 325)
)

我需要从 df2 中取出 Va 和 Vb,并根据 df1 的 V1 和 V2 将它们放入 df1 中。

所以我想要这个输出:

df3:

V1   V2  Va1  Vb1 Va2 Vb2
A    A   12    23  12  23
A    B   12    23  15  23
A    C   12    23  321 543
B    A   15    23  12   23
B    B   15    23  15   23
B    C   15    23  321  543

希望这可以在 R 中完成,而不需要十几个 for 循环:S。

I'm new to R programming and just started to learn it, and i need you to help me with this issue please.

I have 2 data frames :

the first(df1):

    V1 V2
    A  A 
    A  B 
    A  C 
    B  A 
    B  B 
    B  C 

etc

the second(df2) :

V1  Va   Vb
A   12   23
B   15   53
C   321  543
D   54   325
etc..

Use this code to generate the sample data.

df1 <- data.frame(
  V1 = rep(LETTERS[1:2], each = 3L),
  V2 = rep.int(LETTERS[1:3], 2L)
)
dfr2 <- data.frame(
  Va = c(12, 15, 312, 54),
  Vb = c(23, 53, 543, 325)
)

I need to take the Va and Vb from df2 and place them in df1 based on V1 and V2 of df1.

so I want this output:

df3:

V1   V2  Va1  Vb1 Va2 Vb2
A    A   12    23  12  23
A    B   12    23  15  23
A    C   12    23  321 543
B    A   15    23  12   23
B    B   15    23  15   23
B    C   15    23  321  543

hope that this can be done in R without a dozenz of for loops :S.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

迷途知返 2024-11-16 04:30:24

您可以使用 merge() 两次来获得您想要的结果。默认情况下,merge 会查找要连接的公共列名称。在第二次合并中,我们将指定要合并的列:

df1 <- data.frame(V1 = c('A', 'A', 'A', 'B', 'B', 'B'), V2 = c('A', 'B', 'C', 'A', 'B', 'C'))
df2 <- data.frame(V1 = c('A', 'B', 'C', 'D'), Va = c(12, 15, 321, 54), Vb = c(23, 53, 543, 325))

merge(merge(df1, df2), df2, by.x = "V2", by.y = "V1", suffixes = c("1", "2"))

You can use merge() twice to get what you want. By default, merge looks for common column names to join. In the second merge, we'll specify the column we want it to merge on:

df1 <- data.frame(V1 = c('A', 'A', 'A', 'B', 'B', 'B'), V2 = c('A', 'B', 'C', 'A', 'B', 'C'))
df2 <- data.frame(V1 = c('A', 'B', 'C', 'D'), Va = c(12, 15, 321, 54), Vb = c(23, 53, 543, 325))

merge(merge(df1, df2), df2, by.x = "V2", by.y = "V1", suffixes = c("1", "2"))
2024-11-16 04:30:24

您可以为此使用 match() 的功能,前提是 df1$V1 中的名称是唯一的:

#some data
df1 <- data.frame(
  V1 = rep(c("A","B"),each=3),
  V2 = rep(LETTERS[1:3],2)  
)
df2 <- data.frame(V1=LETTERS[1:3],Va=1:3,Vb=3:1)

out <- cbind(df1,
  df2[match(df1$V1,df2$V1),-1],
  df2[match(df1$V2,df2$V1),-1]
)
names(out)[3:6] <- c("Va1","Vb1","Va2","Vb2")
rownames(out) <- 1:nrow(out)

给出

> out
    V1 V2 Va1 Vb1 Va2 Vb2
1    A  A   1   3   1   3
2    A  B   1   3   2   2
3    A  C   1   3   3   1
4    B  A   2   2   1   3
5    B  B   2   2   2   2
6    B  C   2   2   3   1

您必须手动重命名列,因为您将获得多个具有相同名称的列。尽管在数据帧中这在技术上是可行的,但它可能会在以后引起麻烦。您可以使用以下内容自动执行此操作:

names(out) <- 
    c("V1","V2",
      sapply(names(df2)[2:3],paste,1:2,sep="")
    )

编辑:对于大数据帧,转换为矩阵会产生另一个巨大的差异。人们必须注意不同变量类型的内在变化。加速是由于 cbind 和 merge 花费了大量时间来确定每个变量的正确类型。

使用以下数据和函数:

n <- 1e5
df1 <- data.frame(V1 = rep(LETTERS,each=n),V2 = rep(LETTERS,n),
        stringsAsFactors=FALSE)
df2 <- data.frame(V1=LETTERS,Va=1:26,Vb=26:1,stringsAsFactors=FALSE)

fast_JM <- function(df1,df2){
  out <- cbind(
    as.matrix(df2[,-1])[match(df1$V1,df2$V1),],
    as.matrix(df2[,-1])[match(df1$V2,df2$V1),]
  )
  out <- as.data.frame(out)
  names(out) <- sapply(names(df2)[2:3],paste,1:2,sep="")
  out$V1 <- df1$V1
  out$V2 <- df1$V2
  out
}

slow_JM <- function(df1,df2){
  out <- cbind(df1,
    df2[match(df1$V1,df2$V1),-1],
    df2[match(df1$V2,df2$V1),-1]
  )
  names(out)[3:6] <- c("Va1","Vb1","Va2","Vb2")
  out
}


double_merge <- function(df1,df2){
  merge(merge(df1, df2), df2, by.x = "V2", by.y = "V1", suffixes = c("1", "2"))

}

基准测试变为:

require(rbenchmark)
benchmark(fast_JM(df1,df2),slow_JM(df1,df2),double_merge(df1,df2),
      replications=1,columns=c("test","elapsed","relative"),order="relative")

                    test elapsed relative
1      fast_JM(df1, df2)    0.89  1.00000
2      slow_JM(df1, df2)   12.54 14.08989
3 double_merge(df1, df2)   42.50 47.75281

因此,与双重合并相比,加速超过 40 倍,与使用数据帧相比,加速超过 10 倍。

You can use the power of match() for this, provided the names in df1$V1 are unique :

#some data
df1 <- data.frame(
  V1 = rep(c("A","B"),each=3),
  V2 = rep(LETTERS[1:3],2)  
)
df2 <- data.frame(V1=LETTERS[1:3],Va=1:3,Vb=3:1)

out <- cbind(df1,
  df2[match(df1$V1,df2$V1),-1],
  df2[match(df1$V2,df2$V1),-1]
)
names(out)[3:6] <- c("Va1","Vb1","Va2","Vb2")
rownames(out) <- 1:nrow(out)

Gives

> out
    V1 V2 Va1 Vb1 Va2 Vb2
1    A  A   1   3   1   3
2    A  B   1   3   2   2
3    A  C   1   3   3   1
4    B  A   2   2   1   3
5    B  B   2   2   2   2
6    B  C   2   2   3   1

You'll have to rename the columns manually, as you would get multiple columns with the same name. Although it's technically possible in a dataframe, it can cause trouble later on. You can automatize this using something along the lines of:

names(out) <- 
    c("V1","V2",
      sapply(names(df2)[2:3],paste,1:2,sep="")
    )

EDIT : for big dataframes, conversion to matrices makes another huge difference. One has to pay attention to the intrinsic changes in type for the different variables. The speedup is due to the fact that cbind and merge take a whole lot of time figuring out the correct types for each variable.

With following data and functions :

n <- 1e5
df1 <- data.frame(V1 = rep(LETTERS,each=n),V2 = rep(LETTERS,n),
        stringsAsFactors=FALSE)
df2 <- data.frame(V1=LETTERS,Va=1:26,Vb=26:1,stringsAsFactors=FALSE)

fast_JM <- function(df1,df2){
  out <- cbind(
    as.matrix(df2[,-1])[match(df1$V1,df2$V1),],
    as.matrix(df2[,-1])[match(df1$V2,df2$V1),]
  )
  out <- as.data.frame(out)
  names(out) <- sapply(names(df2)[2:3],paste,1:2,sep="")
  out$V1 <- df1$V1
  out$V2 <- df1$V2
  out
}

slow_JM <- function(df1,df2){
  out <- cbind(df1,
    df2[match(df1$V1,df2$V1),-1],
    df2[match(df1$V2,df2$V1),-1]
  )
  names(out)[3:6] <- c("Va1","Vb1","Va2","Vb2")
  out
}


double_merge <- function(df1,df2){
  merge(merge(df1, df2), df2, by.x = "V2", by.y = "V1", suffixes = c("1", "2"))

}

the benchmarking becomes :

require(rbenchmark)
benchmark(fast_JM(df1,df2),slow_JM(df1,df2),double_merge(df1,df2),
      replications=1,columns=c("test","elapsed","relative"),order="relative")

                    test elapsed relative
1      fast_JM(df1, df2)    0.89  1.00000
2      slow_JM(df1, df2)   12.54 14.08989
3 double_merge(df1, df2)   42.50 47.75281

So a speedup of more than 40 times compared to the double merge, and more than 10 times compared to using dataframes.

风启觞 2024-11-16 04:30:24

我认为 Hadley Wickham 的 dplyr 包有一些用于组合数据帧的很棒的工具。到达同一个地方的方式不同。

df1 <- data.frame(
  V1 = rep(LETTERS[1:2], each = 3L),
  V2 = rep.int(LETTERS[1:3], 2L)
)
dfr2 <- data.frame(
  V1 = LETTERS[1:4],
  Va = c(12, 15, 312, 54),
  Vb = c(23, 53, 543, 325)
)

##necessary libraries
library(magrittr, dplyr)

现在,使用 dplyr 包中的 left_join() 和 rename() 函数以及 magrittr 中的管道运算符code> 包,我想可以节省很多击键。

df3 <- df1 %>% #pipe operator (%>%) is from magrittr package
  dplyr::left_join(dfr2, c('V1'="V1")) %>% #merge once
  dplyr::rename(Va1 = Va, Vb1 = Vb) %>%  #rename columns
  dplyr::left_join(dfr2, c('V2'="V1")) %>% #merge on different column
  dplyr::rename(Va2 = Va, Vb2 = Vb) #rename again

这也可以通过一次合并而不是两次合并来完成,但我喜欢看看每个步骤中发生了什么。

> print(df3)
  V1 V2 Va1 Vb1 Va2 Vb2
1  A  A  12  23  12  23
2  A  B  12  23  15  53
3  A  C  12  23 312 543
4  B  A  15  53  12  23
5  B  B  15  53  15  53
6  B  C  15  53 312 543

I think Hadley Wickham's dplyr package has some great tools for combining dataframes. It's a different way to arrive at the same place.

df1 <- data.frame(
  V1 = rep(LETTERS[1:2], each = 3L),
  V2 = rep.int(LETTERS[1:3], 2L)
)
dfr2 <- data.frame(
  V1 = LETTERS[1:4],
  Va = c(12, 15, 312, 54),
  Vb = c(23, 53, 543, 325)
)

##necessary libraries
library(magrittr, dplyr)

Now, using the left_join() and rename() functions from the dplyr package, and the pipe operator from the magrittr package, I think you can save a lot of keystrokes.

df3 <- df1 %>% #pipe operator (%>%) is from magrittr package
  dplyr::left_join(dfr2, c('V1'="V1")) %>% #merge once
  dplyr::rename(Va1 = Va, Vb1 = Vb) %>%  #rename columns
  dplyr::left_join(dfr2, c('V2'="V1")) %>% #merge on different column
  dplyr::rename(Va2 = Va, Vb2 = Vb) #rename again

This can also be done with one merge instead of two, but I like to see what's going on in each step.

> print(df3)
  V1 V2 Va1 Vb1 Va2 Vb2
1  A  A  12  23  12  23
2  A  B  12  23  15  53
3  A  C  12  23 312 543
4  B  A  15  53  12  23
5  B  B  15  53  15  53
6  B  C  15  53 312 543
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文