R 中的加入/分组

发布于 2025-01-11 05:43:58 字数 2861 浏览 0 评论 0原文

我有两个这样的数据集：水果

ID	苹果	橙子	梨
1	0	1	1
2	1	0	0
3	1	1	0
4	0	0	1
5	1	0	0

该数据集表示具有该 ID 的人是否拥有该水果(1)( 0）。这里的ID是主键。

另一个数据集是Juice。该表表示该 ID 在给定日期制作的果汁。该数据集中没有重复项。

ID	日期
1	8/12/2021
1	6/9/2020
2	7/14/2020
2	3/6/2021
2	5/2/2020
3	8/31/2021
5	9/21/2020

我想要的输出就是知道哪种水果被使用了多少次。如果一个 ID 有超过 1 个水果，请考虑他使用这两种水果来制作果汁。

让我们按列跟踪 - 苹果 - ID 2、ID 3 和 ID 5 有苹果。 ID 2 榨汁 3 次，ID 3 榨汁 1 次，ID 3 榨汁 1 次，所以苹果用了 5 次（3+1+1）。同样，ID 1 和 ID 3 有橙色。 ID 1 榨汁 2 次，ID 3 榨汁 1 次，所以橙子用了 3 次（2+1）。 ID 1 榨汁 2 次，ID 4 榨汁 0 次，所以梨用了 2 次。

水果	计数
苹果	5
橙子	3
梨	2

我希望用 R、Python 或 SQL 实现此目的，但我认为 R 具有解决此问题的最佳函数。我不太确定如何解决这个问题，因为涉及两个表。任何帮助将不胜感激。

原文

I have 2 datasets like this:
Fruits

ID	Apples	Oranges	Pears
1	0	1	1
2	1	0	0
3	1	1	0
4	0	0	1
5	1	0	0

This dataset represents if a person with that ID has that fruit(1) or not(0). Here ID is the primary key.

Another dataset is Juice. This table represents juice made by that ID on the given date. There are no duplicates in this dataset.

ID	Dates
1	8/12/2021
1	6/9/2020
2	7/14/2020
2	3/6/2021
2	5/2/2020
3	8/31/2021
5	9/21/2020

My desired output would be to know which fruit was used how many times. If an Id has more than 1 fruit, consider he used both the fruits to make the juice.

Let's follow column-wise- Apples- ID 2, ID 3 and ID 5 has apples. ID 2 made juice 3 times, ID 3 made juice 1 time and ID 3 made juice 1 time, so apple was used 5 times(3+1+1). Similarly, ID 1 and ID 3 has oranges. ID 1 made juice 2 times and ID 3 made juice 1 time, so orange was used 3 times(2+1). ID 1 made juice 2 times, and ID 4 made juice 0 times, so pear was used 2 times.

Fruit	Count
Apples	5
Oranges	3
Pears	2

I want this in R, Python or SQL, though I think R has the best functions to approach this problem. I am not really sure how to approach this as there are two tables involved. Any help would be really appreciated.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一身软味 2025-01-18 05:43:58

R：base

tmp <- lapply(merge(Juice, Fruits, by = "ID", all.left = TRUE)[-(1:2)], sum)
data.frame(Fruit = names(tmp), Count = unlist(tmp, use.names = FALSE))
#     Fruit Count
# 1  Apples     5
# 2 Oranges     3
# 3   Pears     2

R：dplyr

library(dplyr)
library(tidyr) # pivot_longer
Fruits %>%
  pivot_longer(-ID, names_to = "Fruit") %>%
  right_join(Juice, by = "ID") %>%
  filter(value > 0) %>%
  count(Fruit)
# # A tibble: 3 x 2
#   Fruit       n
#   <chr>   <int>
# 1 Apples      5
# 2 Oranges     3
# 3 Pears       2

R：data.table

library(data.table)
JuiceDT <- as.data.table(Juice)    # canonical: setDT(Juice)
FruitsDT <- as.data.table(Fruits)
melt(JuiceDT[FruitsDT, on = .(ID), nomatch=NULL
             ][, lapply(.SD, sum), .SDcols = c("Apples", "Oranges", "Pears")],
     measure.vars = patterns("."),
     variable.name = "Fruit", value.name = "Count")
#      Fruit Count
#     <fctr> <int>
# 1:  Apples     5
# 2: Oranges     3
# 3:   Pears     2

替代方案，与上面的 dplyr 解决方案更加一致：

melt(FruitsDT, id.vars = "ID", variable.name = "Fruit"
  )[JuiceDT, on = .(ID)
  ][, .(Count = sum(value)), by = Fruit]
#      Fruit Count
#     <fctr> <int>
# 1:  Apples     5
# 2: Oranges     3
# 3:   Pears     2

SQL（通过 R 的 `sqldf`）

sqldf::sqldf(
  "with cte as (select * from Juice j left join Fruits f on j.ID=f.ID)
   select 'Apples' as Fruit, sum(Apples) as Count from cte
   union all
   select 'Oranges' as Fruit, sum(Oranges) as Count from cte
   union all
   select 'Pears' as Fruit, sum(Pears) as Count from cte
")
#     Fruit Count
# 1  Apples     5
# 2 Oranges     3
# 3   Pears     2

此实例使用 SQLite 引擎，该引擎不支持PIVOT。还有其他 sqldf 引擎可能支持它，并且对其他 DBMS 执行“原始 SQL”应该允许人们在其方言中更自然地进行转换。

R数据

Juice <- structure(list(ID = c(1L, 1L, 2L, 2L, 2L, 3L, 5L), Dates = c("8/12/2021", "6/9/2020", "7/14/2020", "3/6/2021", "5/2/2020", "8/31/2021", "9/21/2020")), class = "data.frame", row.names = c(NA, -7L))
Fruits <- structure(list(ID = 1:5, Apples = c(0L, 1L, 1L, 0L, 1L), Oranges = c(1L, 0L, 1L, 0L, 0L), Pears = c(1L, 0L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA, -5L))

R: base

tmp <- lapply(merge(Juice, Fruits, by = "ID", all.left = TRUE)[-(1:2)], sum)
data.frame(Fruit = names(tmp), Count = unlist(tmp, use.names = FALSE))
#     Fruit Count
# 1  Apples     5
# 2 Oranges     3
# 3   Pears     2

R: dplyr

library(dplyr)
library(tidyr) # pivot_longer
Fruits %>%
  pivot_longer(-ID, names_to = "Fruit") %>%
  right_join(Juice, by = "ID") %>%
  filter(value > 0) %>%
  count(Fruit)
# # A tibble: 3 x 2
#   Fruit       n
#   <chr>   <int>
# 1 Apples      5
# 2 Oranges     3
# 3 Pears       2

R: data.table

library(data.table)
JuiceDT <- as.data.table(Juice)    # canonical: setDT(Juice)
FruitsDT <- as.data.table(Fruits)
melt(JuiceDT[FruitsDT, on = .(ID), nomatch=NULL
             ][, lapply(.SD, sum), .SDcols = c("Apples", "Oranges", "Pears")],
     measure.vars = patterns("."),
     variable.name = "Fruit", value.name = "Count")
#      Fruit Count
#     <fctr> <int>
# 1:  Apples     5
# 2: Oranges     3
# 3:   Pears     2

Alternative, more aligned with the dplyr solution above:

melt(FruitsDT, id.vars = "ID", variable.name = "Fruit"
  )[JuiceDT, on = .(ID)
  ][, .(Count = sum(value)), by = Fruit]
#      Fruit Count
#     <fctr> <int>
# 1:  Apples     5
# 2: Oranges     3
# 3:   Pears     2

SQL (via R's `sqldf`)

sqldf::sqldf(
  "with cte as (select * from Juice j left join Fruits f on j.ID=f.ID)
   select 'Apples' as Fruit, sum(Apples) as Count from cte
   union all
   select 'Oranges' as Fruit, sum(Oranges) as Count from cte
   union all
   select 'Pears' as Fruit, sum(Pears) as Count from cte
")
#     Fruit Count
# 1  Apples     5
# 2 Oranges     3
# 3   Pears     2

This instance is using the SQLite engine, which does not support PIVOT. There are other sqldf engines that may support it, and doing "raw SQL" to other DBMSes should allow one to pivot more naturally within its dialect.

R data

Juice <- structure(list(ID = c(1L, 1L, 2L, 2L, 2L, 3L, 5L), Dates = c("8/12/2021", "6/9/2020", "7/14/2020", "3/6/2021", "5/2/2020", "8/31/2021", "9/21/2020")), class = "data.frame", row.names = c(NA, -7L))
Fruits <- structure(list(ID = 1:5, Apples = c(0L, 1L, 1L, 0L, 1L), Oranges = c(1L, 0L, 1L, 0L, 0L), Pears = c(1L, 0L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA, -5L))

回复收藏 0 原文

~没有更多了~