R 中的加入/分组

发布于 2025-01-11 05:43:58 字数 2861 浏览 0 评论 0原文

我有两个这样的数据集: 水果

ID苹果橙子
1011
2100
3110
4001
5100

该数据集表示具有该 ID 的人是否拥有该水果(1)( 0)。这里的ID是主键。

另一个数据集是Juice。该表表示该 ID 在给定日期制作的果汁。该数据集中没有重复项。

ID日期
18/12/2021
16/9/2020
27/14/2020
23/6/2021
25/2/2020
38/31/2021
59/21/2020

我想要的输出 就是知道哪种水果被使用了多少次。如果一个 ID 有超过 1 个水果,请考虑他使用这两种水果来制作果汁。

让我们按列跟踪 - 苹果 - ID 2、ID 3 和 ID 5 有苹果。 ID 2 榨汁 3 次,ID 3 榨汁 1 次,ID 3 榨汁 1 次,所以苹果用了 5 次(3+1+1)。同样,ID 1 和 ID 3 有橙色。 ID 1 榨汁 2 次,ID 3 榨汁 1 次,所以橙子用了 3 次(2+1)。 ID 1 榨汁 2 次,ID 4 榨汁 0 次,所以梨用了 2 次。

水果计数
苹果5
橙子3
2

我希望用 R、Python 或 SQL 实现此目的,但我认为 R 具有解决此问题的最佳函数。我不太确定如何解决这个问题,因为涉及两个表。任何帮助将不胜感激。

I have 2 datasets like this:
Fruits

IDApplesOrangesPears
1011
2100
3110
4001
5100

This dataset represents if a person with that ID has that fruit(1) or not(0). Here ID is the primary key.

Another dataset is Juice. This table represents juice made by that ID on the given date. There are no duplicates in this dataset.

IDDates
18/12/2021
16/9/2020
27/14/2020
23/6/2021
25/2/2020
38/31/2021
59/21/2020

My desired output would be to know which fruit was used how many times. If an Id has more than 1 fruit, consider he used both the fruits to make the juice.

Let's follow column-wise- Apples- ID 2, ID 3 and ID 5 has apples. ID 2 made juice 3 times, ID 3 made juice 1 time and ID 3 made juice 1 time, so apple was used 5 times(3+1+1). Similarly, ID 1 and ID 3 has oranges. ID 1 made juice 2 times and ID 3 made juice 1 time, so orange was used 3 times(2+1). ID 1 made juice 2 times, and ID 4 made juice 0 times, so pear was used 2 times.

FruitCount
Apples5
Oranges3
Pears2

I want this in R, Python or SQL, though I think R has the best functions to approach this problem. I am not really sure how to approach this as there are two tables involved. Any help would be really appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

一身软味 2025-01-18 05:43:58

R:base

tmp <- lapply(merge(Juice, Fruits, by = "ID", all.left = TRUE)[-(1:2)], sum)
data.frame(Fruit = names(tmp), Count = unlist(tmp, use.names = FALSE))
#     Fruit Count
# 1  Apples     5
# 2 Oranges     3
# 3   Pears     2

R:dplyr

library(dplyr)
library(tidyr) # pivot_longer
Fruits %>%
  pivot_longer(-ID, names_to = "Fruit") %>%
  right_join(Juice, by = "ID") %>%
  filter(value > 0) %>%
  count(Fruit)
# # A tibble: 3 x 2
#   Fruit       n
#   <chr>   <int>
# 1 Apples      5
# 2 Oranges     3
# 3 Pears       2

R:data.table

library(data.table)
JuiceDT <- as.data.table(Juice)    # canonical: setDT(Juice)
FruitsDT <- as.data.table(Fruits)
melt(JuiceDT[FruitsDT, on = .(ID), nomatch=NULL
             ][, lapply(.SD, sum), .SDcols = c("Apples", "Oranges", "Pears")],
     measure.vars = patterns("."),
     variable.name = "Fruit", value.name = "Count")
#      Fruit Count
#     <fctr> <int>
# 1:  Apples     5
# 2: Oranges     3
# 3:   Pears     2

替代方案,与上面的 dplyr 解决方案更加一致:

melt(FruitsDT, id.vars = "ID", variable.name = "Fruit"
  )[JuiceDT, on = .(ID)
  ][, .(Count = sum(value)), by = Fruit]
#      Fruit Count
#     <fctr> <int>
# 1:  Apples     5
# 2: Oranges     3
# 3:   Pears     2

SQL(通过 R 的 sqldf

sqldf::sqldf(
  "with cte as (select * from Juice j left join Fruits f on j.ID=f.ID)
   select 'Apples' as Fruit, sum(Apples) as Count from cte
   union all
   select 'Oranges' as Fruit, sum(Oranges) as Count from cte
   union all
   select 'Pears' as Fruit, sum(Pears) as Count from cte
")
#     Fruit Count
# 1  Apples     5
# 2 Oranges     3
# 3   Pears     2

实例使用 SQLite 引擎,该引擎不支持PIVOT。还有其他 sqldf 引擎可能支持它,并且对其他 DBMS 执行“原始 SQL”应该允许人们在其方言中更自然地进行转换。


R数据

Juice <- structure(list(ID = c(1L, 1L, 2L, 2L, 2L, 3L, 5L), Dates = c("8/12/2021", "6/9/2020", "7/14/2020", "3/6/2021", "5/2/2020", "8/31/2021", "9/21/2020")), class = "data.frame", row.names = c(NA, -7L))
Fruits <- structure(list(ID = 1:5, Apples = c(0L, 1L, 1L, 0L, 1L), Oranges = c(1L, 0L, 1L, 0L, 0L), Pears = c(1L, 0L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA, -5L))

R: base

tmp <- lapply(merge(Juice, Fruits, by = "ID", all.left = TRUE)[-(1:2)], sum)
data.frame(Fruit = names(tmp), Count = unlist(tmp, use.names = FALSE))
#     Fruit Count
# 1  Apples     5
# 2 Oranges     3
# 3   Pears     2

R: dplyr

library(dplyr)
library(tidyr) # pivot_longer
Fruits %>%
  pivot_longer(-ID, names_to = "Fruit") %>%
  right_join(Juice, by = "ID") %>%
  filter(value > 0) %>%
  count(Fruit)
# # A tibble: 3 x 2
#   Fruit       n
#   <chr>   <int>
# 1 Apples      5
# 2 Oranges     3
# 3 Pears       2

R: data.table

library(data.table)
JuiceDT <- as.data.table(Juice)    # canonical: setDT(Juice)
FruitsDT <- as.data.table(Fruits)
melt(JuiceDT[FruitsDT, on = .(ID), nomatch=NULL
             ][, lapply(.SD, sum), .SDcols = c("Apples", "Oranges", "Pears")],
     measure.vars = patterns("."),
     variable.name = "Fruit", value.name = "Count")
#      Fruit Count
#     <fctr> <int>
# 1:  Apples     5
# 2: Oranges     3
# 3:   Pears     2

Alternative, more aligned with the dplyr solution above:

melt(FruitsDT, id.vars = "ID", variable.name = "Fruit"
  )[JuiceDT, on = .(ID)
  ][, .(Count = sum(value)), by = Fruit]
#      Fruit Count
#     <fctr> <int>
# 1:  Apples     5
# 2: Oranges     3
# 3:   Pears     2

SQL (via R's sqldf)

sqldf::sqldf(
  "with cte as (select * from Juice j left join Fruits f on j.ID=f.ID)
   select 'Apples' as Fruit, sum(Apples) as Count from cte
   union all
   select 'Oranges' as Fruit, sum(Oranges) as Count from cte
   union all
   select 'Pears' as Fruit, sum(Pears) as Count from cte
")
#     Fruit Count
# 1  Apples     5
# 2 Oranges     3
# 3   Pears     2

This instance is using the SQLite engine, which does not support PIVOT. There are other sqldf engines that may support it, and doing "raw SQL" to other DBMSes should allow one to pivot more naturally within its dialect.


R data

Juice <- structure(list(ID = c(1L, 1L, 2L, 2L, 2L, 3L, 5L), Dates = c("8/12/2021", "6/9/2020", "7/14/2020", "3/6/2021", "5/2/2020", "8/31/2021", "9/21/2020")), class = "data.frame", row.names = c(NA, -7L))
Fruits <- structure(list(ID = 1:5, Apples = c(0L, 1L, 1L, 0L, 1L), Oranges = c(1L, 0L, 1L, 0L, 0L), Pears = c(1L, 0L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA, -5L))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文