计算 R 中的小计
我在 R 中有一个包含 900,000 行和 11 列的数据框。列名称和类型如下:
column name: date / mcode / mname / ycode / yname / yissue / bsent / breturn / tsent / treturn / csales
type: Date / Char / Char / Char / Char / Numeric / Numeric / Numeric / Numeric / Numeric / Numeric
我想计算小计。例如,我想计算 yname 每次更改时的总和,并将小计添加到所有数值变量中。有 160 个不同的 yname,因此结果表应该告诉我每个 yname 的小计。我还没有对数据进行排序,但这不是问题,因为我可以按照我想要的任何方式对数据进行排序。以下是我的数据片段:
date mcode mname ycode yname yissue bsent breturn tsent treturn csales
417572 2010-07-28 45740 ENDPOINT A 5772 XMAG 20100800 7 0 7 0 0
417573 2010-07-31 45740 ENDPOINT A 5772 XMAG 20100800 0 0 0 0 1
417574 2010-08-04 45740 ENDPOINT A 5772 XMAG 20100800 0 0 0 0 1
417575 2010-08-14 45740 ENDPOINT A 5772 XMAG 20100800 0 0 0 0 1
417576 2010-08-26 45740 ENDPOINT A 5772 XMAG 20100800 0 4 0 0 0
417577 2010-07-28 45741 ENDPOINT L 5772 XMAG 20100800 2 0 2 0 0
417578 2010-08-04 45741 ENDPOINT L 5772 XMAG 20100800 2 0 2 0 0
417579 2010-08-26 45741 ENDPOINT L 5772 XMAG 20100800 0 4 0 0 0
417580 2010-07-28 46390 ENDPOINT R 5772 XMAG 20100800 3 0 3 0 1
417581 2010-07-29 46390 ENDPOINT R 5772 XMAG 20100800 0 0 0 0 2
417582 2010-08-01 46390 ENDPOINT R 5779 YMAG 20100800 3 0 3 0 0
417583 2010-08-11 46390 ENDPOINT R 5779 YMAG 20100800 0 0 0 0 1
417584 2010-08-20 46390 ENDPOINT R 5779 YMAG 20100800 0 0 0 0 1
417585 2010-08-24 46390 ENDPOINT R 5779 YMAG 20100800 2 0 2 0 1
417586 2010-08-26 46390 ENDPOINT R 5779 YMAG 20100800 0 2 0 2 0
417587 2010-07-28 46411 ENDPOINT D 5779 YMAG 20100800 6 0 6 0 0
417588 2010-08-08 46411 ENDPOINT D 5779 YMAG 20100800 0 0 0 0 1
417589 2010-08-11 46411 ENDPOINT D 5779 YMAG 20100800 0 0 0 0 1
417590 2010-08-26 46411 ENDPOINT D 5779 YMAG 20100800 0 4 0 4 0
我应该在这里使用什么函数?也许类似于 SQL group by
之类的东西?
I have a data frame with 900,000 rows and 11 columns in R. The column names and types are as follows:
column name: date / mcode / mname / ycode / yname / yissue / bsent / breturn / tsent / treturn / csales
type: Date / Char / Char / Char / Char / Numeric / Numeric / Numeric / Numeric / Numeric / Numeric
I want to calculate the subtotals. For example, I want to calculate the sums at each change in yname, and add subtotal to all numerical variables. There are 160 distinct ynames, so the resulting table should tell me the subtotal of each yname. I haven't sorted the data yet, but this is not a problem because I can sort the data in whatever way I want. Below is a snippet from my data:
date mcode mname ycode yname yissue bsent breturn tsent treturn csales
417572 2010-07-28 45740 ENDPOINT A 5772 XMAG 20100800 7 0 7 0 0
417573 2010-07-31 45740 ENDPOINT A 5772 XMAG 20100800 0 0 0 0 1
417574 2010-08-04 45740 ENDPOINT A 5772 XMAG 20100800 0 0 0 0 1
417575 2010-08-14 45740 ENDPOINT A 5772 XMAG 20100800 0 0 0 0 1
417576 2010-08-26 45740 ENDPOINT A 5772 XMAG 20100800 0 4 0 0 0
417577 2010-07-28 45741 ENDPOINT L 5772 XMAG 20100800 2 0 2 0 0
417578 2010-08-04 45741 ENDPOINT L 5772 XMAG 20100800 2 0 2 0 0
417579 2010-08-26 45741 ENDPOINT L 5772 XMAG 20100800 0 4 0 0 0
417580 2010-07-28 46390 ENDPOINT R 5772 XMAG 20100800 3 0 3 0 1
417581 2010-07-29 46390 ENDPOINT R 5772 XMAG 20100800 0 0 0 0 2
417582 2010-08-01 46390 ENDPOINT R 5779 YMAG 20100800 3 0 3 0 0
417583 2010-08-11 46390 ENDPOINT R 5779 YMAG 20100800 0 0 0 0 1
417584 2010-08-20 46390 ENDPOINT R 5779 YMAG 20100800 0 0 0 0 1
417585 2010-08-24 46390 ENDPOINT R 5779 YMAG 20100800 2 0 2 0 1
417586 2010-08-26 46390 ENDPOINT R 5779 YMAG 20100800 0 2 0 2 0
417587 2010-07-28 46411 ENDPOINT D 5779 YMAG 20100800 6 0 6 0 0
417588 2010-08-08 46411 ENDPOINT D 5779 YMAG 20100800 0 0 0 0 1
417589 2010-08-11 46411 ENDPOINT D 5779 YMAG 20100800 0 0 0 0 1
417590 2010-08-26 46411 ENDPOINT D 5779 YMAG 20100800 0 4 0 4 0
What function should I use here? Maybe something like SQL group by
?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
好的。假设您的数据位于名为
foo
的数据框中:那么这将对数据中的数字列进行聚合:
这是使用您在 Q 中包含的数据片段。我使用了公式界面到
aggregate()
,在这种情况下更好一些,因为您不需要希望聚合的变量名称上的所有foo$
位。如果完整数据集中缺少数据 (NA
),则需要添加一个额外参数na.rm = TRUE
,该参数将传递给sum()
,像这样:OK. Assuming your data are in a data frame named
foo
:Then this will do the aggregation of the numeric columns in your data:
That was using the snippet of data you included in your Q. I used the formula interface to
aggregate()
, which is a bit nicer in this instance because you don't need all thefoo$
bits on the variable names you wish the aggregate. If you have missing data (NA
)in your full data set, then you'll need add an extra argumentna.rm = TRUE
which will get passed tosum()
, like so:当我试图找到类似问题的答案时,谷歌并没有多大帮助。我想我应该使用带有
split()
和purrr::map_df()
的library(janitor)
包来分享下面的解决方案。我的用例是运行一个脚本,该脚本可以从许多不同的人那里获取 CC 费用,以便由一个人进行审查。
Google wasn't super helpful when I tried to find an answer to a similar question. I thought I would share my solution below using the
library(janitor)
package withsplit()
, andpurrr::map_df()
.My use case was to run a script that would grab CC expenses from many different people to be reviewed by a person.
或者是
plyr
库,它可以轻松扩展到其他数据类:Or the
plyr
library, which is easily extensible to other data classes:您还可以使用
xtabs
或tapply
:You can also use
xtabs
ortapply
:如果您的数据很大并且速度很重要,我建议使用 R 函数 rowsum,它要快得多。我应用了答案中建议的 3 种方法(f1=aggregate、f2=ddply、f3=tapply),将其与 f4=rowsum 进行比较,这就是我发现的:
如果有人想更详细地探索,我在下面添加了我的代码。
if your data is large and speed matters, i would recommend using the R function rowsum, which is a lot faster. i applied the 3 methods (f1 = aggregate, f2 = ddply, f3 = tapply) suggested in the answers to compare it with f4 = rowsum and here is what i find:
i have added my code below if someone wants to explore in more detail.
你可以使用
aggregate
例如,假设你有
那么你可以这样做
You can use
aggregate
For instance, say that you have
Then you can do
有一个名为 sqldf 的 R 包,使您能够在 R data.frames 上使用 SQL 命令。
除了像你已经说过的那样,GROUP BY 会很好。您可以轻松地将数据存储在本地 MySQL 数据库中,并使用 RMySQL 包连接到 R(您也可以使用大多数其他 DBMS,但 MySQL 是最容易设置的)。
据我判断,plyr 也是一个很棒的软件包。但从你提出问题并将问题与 GROUP BY 进行比较的方式来看,我猜你对 SQL 有所了解,所以使用它对你来说可能更容易。有一些舒适的函数,例如 dbReadTable,此外,如果您的数据变得更大,您可以仅选择数据的子部分,以便只运行您真正需要的分析。
There is a R package called sqldf that enables you to use SQL commands on R data.frames.
Besides like you already said, GROUP BY would be nice. You can easily store your data in a local MySQL database and connect to R using the package RMySQL (You can use most other DBMS too but MySQL is the easiest to set up).
As far as I can judge it plyr is a great package, too. But from the way you ask and compare your problem to GROUP BY, I guess you know something about SQL, so using this might be easier for you. There are comfortable functions like dbReadTable, plus if your data grows bigger you can select only subparts of your data to only run your analysis with what you really need.