R 中的数据分组(类似于 SQL 中使用 WHERE 子句求和)

发布于 2024-12-02 16:57:38 字数 2560 浏览 1 评论 0原文

我在使用 aggregate 命令在 R 中对数据进行分组时遇到了挑战。我可以使用 SQL 相对轻松地做到这一点,但我无法使用 R 获得相同的效果。

作为示例,这里是一个测试数据集,在 SQL 中提供了我正在寻找的本质:

create table #data(v1 varchar(4), v2 int, v3 int, v4 int,v5 int)
insert #data(v1, v2, v3, v4, v5) values(8000, 3, 8, 7, 11)
insert #data(v1, v2, v3, v4, v5) values(8001, 4, 9, 8, 12)
insert #data(v1, v2, v3, v4, v5) values(8002, 5, 10, 9, 13)
insert #data(v1, v2, v3, v4, v5) values(8003, 6, 11, 7, 14)
insert #data(v1, v2, v3, v4, v5) values(8000, 7, 12, 8, 11)
insert #data(v1, v2, v3, v4, v5) values(8001, 3, 13, 9, 12)
insert #data(v1, v2, v3, v4, v5) values(8002, 4, 14, 7, 13)
insert #data(v1, v2, v3, v4, v5) values(8003, 5, 8, 8, 14)
insert #data(v1, v2, v3, v4, v5) values(8000, 6, 9, 9, 11)
insert #data(v1, v2, v3, v4, v5) values(8001, 7, 10, 7, 12)
insert #data(v1, v2, v3, v4, v5) values(8002, 3, 11, 8, 13)
insert #data(v1, v2, v3, v4, v5) values(8003, 4, 12, 9, 14)
insert #data(v1, v2, v3, v4, v5) values(8000, 5, 13, 7, 11)
insert #data(v1, v2, v3, v4, v5) values(8001, 6, 14, 8, 12)
insert #data(v1, v2, v3, v4, v5) values(8002, 7, 8, 9, 13)
insert #data(v1, v2, v3, v4, v5) values(8003, 3, 9, 7, 14)
insert #data(v1, v2, v3, v4, v5) values(8000, 4, 10, 8, 11)
insert #data(v1, v2, v3, v4, v5) values(8001, 5, 11, 9, 12)
insert #data(v1, v2, v3, v4, v5) values(8002, 6, 12, 7, 13)
insert #data(v1, v2, v3, v4, v5) values(8003, 7, 13, 8, 14)
insert #data(v1, v2, v3, v4, v5) values(8000, 3, 14, 9, 11)
insert #data(v1, v2, v3, v4, v5) values(8001, 4, 8, 7, 12)
insert #data(v1, v2, v3, v4, v5) values(8002, 5, 9, 8, 13)
insert #data(v1, v2, v3, v4, v5) values(8003, 6, 10, 9, 14)
insert #data(v1, v2, v3, v4, v5) values(8000, 7, 11, 7, 11)
insert #data(v1, v2, v3, v4, v5) values(8001, 3, 12, 8, 12)
insert #data(v1, v2, v3, v4, v5) values(8002, 4, 13, 9, 13)
insert #data(v1, v2, v3, v4, v5) values(8003, 5, 14, 7, 14)

select * from #data

select v1, sum(v2)
from #data
    where v4 <= v3 and v5 > v3
group by v1

drop table #data

在 R 中,我尝试过将 aggregatesubset 命令一起使用,将 aggregate 与内联函数(x)一起使用...但仍然无法合并数据以我希望的方式纯粹“R”方式。我知道 SQL 库允许 SQL 语句在数据结构上进行事务处理,但我希望避免这种情况,因为我首先从数据库中提取数据,如果我需要诉诸于此,我不妨编写 SQL 来完成它。我正在寻找一种纯粹的 R 方法来做到这一点。也许有点理想化,但这就是希望,就是梦想。

该子句需要注意的是,它是一个排除子句,对当前行中 V2 变量未包含的数据进行求和。如果不是这样,这对我来说将是一个相当简单的问题,但我不会 grep 将当前行在仔细检查的值下传递给针对更大数据集的函数,或者是否另一个解决方案会更好。

预先感谢您的帮助。

以下是生成测试数据的 R 代码:

m.data <- as.data.frame(
        cbind(8000:8003, 3:7, 8:14, 7:9, 11:14, 1:28),
        row.names=NULL
);

I am having a challenge grouping data in R using the aggregate command. I can do this with SQL relatively easily, but I am not able the get the same effect with R.

As an example, here is a test set of data, that in SQL provides the essence of what I am looking for:

create table #data(v1 varchar(4), v2 int, v3 int, v4 int,v5 int)
insert #data(v1, v2, v3, v4, v5) values(8000, 3, 8, 7, 11)
insert #data(v1, v2, v3, v4, v5) values(8001, 4, 9, 8, 12)
insert #data(v1, v2, v3, v4, v5) values(8002, 5, 10, 9, 13)
insert #data(v1, v2, v3, v4, v5) values(8003, 6, 11, 7, 14)
insert #data(v1, v2, v3, v4, v5) values(8000, 7, 12, 8, 11)
insert #data(v1, v2, v3, v4, v5) values(8001, 3, 13, 9, 12)
insert #data(v1, v2, v3, v4, v5) values(8002, 4, 14, 7, 13)
insert #data(v1, v2, v3, v4, v5) values(8003, 5, 8, 8, 14)
insert #data(v1, v2, v3, v4, v5) values(8000, 6, 9, 9, 11)
insert #data(v1, v2, v3, v4, v5) values(8001, 7, 10, 7, 12)
insert #data(v1, v2, v3, v4, v5) values(8002, 3, 11, 8, 13)
insert #data(v1, v2, v3, v4, v5) values(8003, 4, 12, 9, 14)
insert #data(v1, v2, v3, v4, v5) values(8000, 5, 13, 7, 11)
insert #data(v1, v2, v3, v4, v5) values(8001, 6, 14, 8, 12)
insert #data(v1, v2, v3, v4, v5) values(8002, 7, 8, 9, 13)
insert #data(v1, v2, v3, v4, v5) values(8003, 3, 9, 7, 14)
insert #data(v1, v2, v3, v4, v5) values(8000, 4, 10, 8, 11)
insert #data(v1, v2, v3, v4, v5) values(8001, 5, 11, 9, 12)
insert #data(v1, v2, v3, v4, v5) values(8002, 6, 12, 7, 13)
insert #data(v1, v2, v3, v4, v5) values(8003, 7, 13, 8, 14)
insert #data(v1, v2, v3, v4, v5) values(8000, 3, 14, 9, 11)
insert #data(v1, v2, v3, v4, v5) values(8001, 4, 8, 7, 12)
insert #data(v1, v2, v3, v4, v5) values(8002, 5, 9, 8, 13)
insert #data(v1, v2, v3, v4, v5) values(8003, 6, 10, 9, 14)
insert #data(v1, v2, v3, v4, v5) values(8000, 7, 11, 7, 11)
insert #data(v1, v2, v3, v4, v5) values(8001, 3, 12, 8, 12)
insert #data(v1, v2, v3, v4, v5) values(8002, 4, 13, 9, 13)
insert #data(v1, v2, v3, v4, v5) values(8003, 5, 14, 7, 14)

select * from #data

select v1, sum(v2)
from #data
    where v4 <= v3 and v5 > v3
group by v1

drop table #data

In R, I have tried using aggregate with a subset command, aggregate with an inline function(x)... and still have not been able to coalesce the data in the way I was hoping for in a purely 'R' manner. I am aware of the SQL library to allow SQL statements to be transacted upon a data structure, but I am looking to avoid that since I am pulling the data from a database in the first place, and if I need to resort to that, I might as well write the SQL to do it. I am looking for a purist R way of doing it. Perhaps it is a bit idealistic, but that is the hope and the dream.

Something to note with the clause is it is something of an exclusionary clause summing the data NOT included by the V2 variable in the current row. If it were otherwise, this would be a fairly simple problem for me, but I am not grepping the passing the current line under scrutiny's values to a function against the larger data set, or if another solution altogether would be better.

Thanks in advance for the help.

Here is the R code to generate the test data:

m.data <- as.data.frame(
        cbind(8000:8003, 3:7, 8:14, 7:9, 11:14, 1:28),
        row.names=NULL
);

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

暗藏城府 2024-12-09 16:57:38

这是一个玩具示例,说明了加文在评论中所描述的内容:

dd <- data.frame(v1 = rep(1:4,5),v2 = 1:20, 
                 v3 = runif(20), v4 = runif(20), v5 = runif(20))

#Extract the subset
dd_new <- subset(dd,v4 <= v3 & v5 > v3)

#Using the aggregate command...
> aggregate(dd_new$v2,list(v1=dd_new$v1),sum)
  v1  x
1  1 14
2  2 18
3  3 41
4  4 16

#Or the often popular ddply from plyr
> ddply(dd_new,.(v1),summarise,tot = sum(v2))
  v1 tot
1  1  14
2  2  18
3  3  41
4  4  16

为了清晰起见,我将子集从聚合中分离出来,但正如加文指出的那样,如果您愿意,您可以将其全部合并到一行中。

Here's a toy example illustrating what Gavin described in the comments:

dd <- data.frame(v1 = rep(1:4,5),v2 = 1:20, 
                 v3 = runif(20), v4 = runif(20), v5 = runif(20))

#Extract the subset
dd_new <- subset(dd,v4 <= v3 & v5 > v3)

#Using the aggregate command...
> aggregate(dd_new$v2,list(v1=dd_new$v1),sum)
  v1  x
1  1 14
2  2 18
3  3 41
4  4 16

#Or the often popular ddply from plyr
> ddply(dd_new,.(v1),summarise,tot = sum(v2))
  v1 tot
1  1  14
2  2  18
3  3  41
4  4  16

I split up the subset from the aggregating for clarity, but as Gavin indicated you can roll it all into a single line if you like.

つ低調成傷 2024-12-09 16:57:38

如果您是 SQL 迷,请尝试 sqldf。它对于大型数据集非常有效。请注意,我使用了 mdata 而不是需要在 SQL 中转义的 m.data。

library(sqldf)
mdata <- as.data.frame(
        cbind(8000:8003, 3:7, 8:14, 7:9, 11:14, 1:28),
        row.names=NULL
);
sqldf("select v1, sum(v2) from mdata  where v4 <= v3 and v5 > v3 group by v1")

And if you are an SQL-junkie, try sqldf. It is very effective with large data sets. Note that I used mdata instead of m.data which needs escaping in SQL.

library(sqldf)
mdata <- as.data.frame(
        cbind(8000:8003, 3:7, 8:14, 7:9, 11:14, 1:28),
        row.names=NULL
);
sqldf("select v1, sum(v2) from mdata  where v4 <= v3 and v5 > v3 group by v1")
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文