R sqldf 计数、分组依据、具有 > 1

发布于 2025-01-20 17:11:58 字数 781 浏览 2 评论 0原文

我有一个有3列的表，有几千条记录，示例如下：

df1 <- data.frame(
  ID = c('V1', 'V1', 'V1', 'V3', 'V3', 'V3', 'V4', 'V5','V5','V5'),
  Category = c('a', 'a', 'a', 'a', 'b', 'b', 'a', 'b', 'c', 'c'),
  Amount   = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1))

需要使用sqldf查询按ID和类别对数据进行分组，然后对金额进行求和，并且仅当ID有多个类别时才返回行。结果如下所示：

ID  Category Amount_Sum
V3  a        1
V3  b        2
V5  b        1
V5  c        2

我尝试了以下代码，但它实际上不起作用，还有分区上的另一个 row_number ，它可以工作，但比需要许多额外查询所需的时间要长得多。

df2 <- sqldf::sqldf("
  SELECT   Count(*) [CNT]
           [ID], 
           [Category], 
           SUM([Amount]) [amount]
  FROM     df1
  GROUP BY [ID], 
           [Category]
  Having   Count(*) > 1")

在 R 中使用 sqldf 获取输出的最佳方法是什么？谢谢！

原文

I have a table with 3 columns, and a few thousand records, sample is below:

df1 <- data.frame(
  ID = c('V1', 'V1', 'V1', 'V3', 'V3', 'V3', 'V4', 'V5','V5','V5'),
  Category = c('a', 'a', 'a', 'a', 'b', 'b', 'a', 'b', 'c', 'c'),
  Amount   = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1))

Need to query with sqldf to group the data by ID and Category, then sum the amount, and only return rows when the ID has more than one category. The result would look like below:

ID  Category Amount_Sum
V3  a        1
V3  b        2
V5  b        1
V5  c        2

I tried the following code which doesnt really work, and also another row_number over partition which sort of works but is way longer than it needs to be needing many additional queries.

df2 <- sqldf::sqldf("
  SELECT   Count(*) [CNT]
           [ID], 
           [Category], 
           SUM([Amount]) [amount]
  FROM     df1
  GROUP BY [ID], 
           [Category]
  Having   Count(*) > 1")

What is the best way to get that output with sqldf in R? Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

累赘 2025-01-27 17:11:58

您可以添加一个联接，该联接仅限于具有多个类别的 ID：

SELECT t1.ID, t1.Category, t1.Amount_Sum
FROM
(
    SELECT ID, Category, SUM(Amount) AS Amount_Sum
    FROM df1
    GROUP BY ID, Category
) t1
INNER JOIN
(
    SELECT ID
    FROM df1
    GROUP BY ID
    HAVING COUNT(DISTINCT Category) > 1
) t2
    ON t2.ID = t1.ID;

You could add a join which restricts to only IDs having more than one category:

SELECT t1.ID, t1.Category, t1.Amount_Sum
FROM
(
    SELECT ID, Category, SUM(Amount) AS Amount_Sum
    FROM df1
    GROUP BY ID, Category
) t1
INNER JOIN
(
    SELECT ID
    FROM df1
    GROUP BY ID
    HAVING COUNT(DISTINCT Category) > 1
) t2
    ON t2.ID = t1.ID;

回复收藏 0 原文

~没有更多了~