按 ID 聚合并找到时间的 min() 和 max()
我有一个交易数据库,如下所示:
AccountID PaymentDate PaymentAmount
8 13 2020-02-09 1.00
9 13 2020-01-25 4.20
10 14 2020-01-01 30.68
11 14 2020-02-01 30.68
PaymentDate 采用 posix 格式。对于交易数据,我不想按时间间隔进行聚合(这是有详细记录的),而是按 ID 进行聚合。
将 min() 与 Posix 时间一起使用给出第一天,使用 max() 给出最后一天。这是我每个 ID 都需要的信息。
好的,这是我尝试过的:
# 1.
summaryBy(PaymentDate ~ AccountID, data1, FUN=c(min) )
Fehler in tapply(lh.data[, lh.var[vv]], rh.string.factor, function(x) { : arguments must have same length
# 2.
ddply( data1, "AccountID", summarise, min(PaymentDate))
# returns 0 and warnings:
50: In output[[var]][rng] <- df[[var]] : Anzahl der zu ersetzenden Elemente ist kein Vielfaches der Ersetzungslänge
# 3.
aggregate(PaymentDate ~ AccountID, data1, min)
Fehler in model.frame.default(formula = PaymentDate ~ AccountID, data = data1) : ungültiger Typ (list) für die Variable 'PaymentDate'
显然,如果您需要时间聚合而不是按时间聚合,则聚合不适用于 posix 时间。
但获取第一个和最后一个交易日期一定是可能的吗?!
好的
,因为我还不能回答我自己的问题,所以我将其发布在这里:
有趣。谢谢你!
我通常将 as.is=T 选项与 read.csv 一起使用,然后使用 strptime 转换时间。因此,当我查看数据结构时,我得到:
$ PaymentDate : POSIXlt, format: "2020-02-04" "2020-02-04" "2020-02-04" ...
对我来说,这看起来不是一个因素。我可以在整个列上使用 min() 和 max() 并且它可以工作。显然,POSIXlt比我想象的更麻烦。来自 POSIXlt,我
data$PaymentDate=as.Date(data$PaymentDate)
查看了结构,类被正确设置为日期。
$ PaymentDate :Class 'Date' num [1:10000] 18296 18296 18296 18297 18297 ...
现在看来有效了。但是,只有ddply返回正确的格式“2020-01-25”,而aggregate和summaryBy都以“18286”格式返回。那是自 1970 年 1 月 1 日以来的日子吗?好吧,我想我可以把它转换回来。
foo=aggregate(PaymentDate ~ AccountID, data1, min)
as.Date(foo$PaymentDate,origin="1970-01-01")
然而,必须有一些解释。而且,ddply 慢得多。
哦,为什么我先用strptime?嗯,原始文件中的日期采用不同的格式,“%d-%m-%y”。直接使用 as.Date 似乎不起作用。
编辑
我的数据的dput
structure(list(AccountID = c(17L, 17L, 17L, 17L, 17L, 17L, 17L,
17L, 17L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L,
359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L,
359L, 359L, 359L, 359L, 365L, 939L, 939L, 939L, 997L, 997L, 1181L
), PaymentDate = structure(list(sec = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), min = c(0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L), hour = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
mday = c(4L, 4L, 4L, 5L, 5L, 5L, 5L, 9L, 25L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 3L, 4L, 4L, 17L, 8L, 17L, 28L, 8L, 22L, 3L),
mon = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 2L, 3L,
3L, 5L, 6L, 6L, 7L, 8L, 8L, 9L, 9L, 11L, 11L, 1L, 2L, 5L,
7L, 10L, 10L, 4L, 0L, 4L, 6L, 3L, 2L, 11L, 11L, 4L, 10L),
year = c(110L, 110L, 110L, 110L, 110L, 110L, 110L, 110L,
110L, 109L, 110L, 110L, 109L, 110L, 110L, 109L, 110L, 109L,
109L, 110L, 109L, 110L, 109L, 110L, 109L, 109L, 109L, 110L,
109L, 110L, 110L, 110L, 109L, 109L, 110L, 109L, 109L, 110L,
109L, 109L), wday = c(4L, 4L, 4L, 5L, 5L, 5L, 5L, 2L, 1L,
4L, 1L, 1L, 3L, 4L, 2L, 3L, 4L, 6L, 2L, 3L, 4L, 5L, 2L, 3L,
1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 5L, 4L, 2L, 1L, 3L, 5L,
2L), yday = c(34L, 34L, 34L, 35L, 35L, 35L, 35L, 39L, 24L,
0L, 31L, 59L, 90L, 90L, 151L, 181L, 181L, 212L, 243L, 243L,
273L, 273L, 334L, 334L, 32L, 60L, 152L, 213L, 305L, 305L,
122L, 3L, 123L, 197L, 97L, 75L, 361L, 341L, 141L, 306L),
isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT")), .Names = c("AccountID",
"PaymentDate"), row.names = c(NA, 40L), class = "data.frame")
按照您的建议进行操作后的dput:
structure(list(AccountID = c(359L, 359L, 359L, 359L, 359L, 359L,
359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L,
359L, 359L, 359L, 359L, 359L, 359L, 359L, 365L, 939L, 939L, 939L,
997L, 997L, 1181L, 1181L, 1181L, 1181L, 1181L, 1181L, 1181L,
1181L, 1181L, 1181L), PaymentDate = structure(c(14245, 14277,
14305, 14335, 14368, 14397, 14426, 14457, 14488, 14518, 14550,
14579, 14613, 14641, 14669, 14700, 14732, 14761, 14791, 14823,
14853, 14883, 14915, 14944, 14442, 14320, 14606, 14707, 14386,
14951, 14293, 14432, 14477, 14540, 14540, 14540, 14540, 14540,
14540, 14551), class = "Date")), .Names = c("AccountID", "PaymentDate"
), row.names = c(10L, 25L, 26L, 13L, 33L, 27L, 16L, 18L, 19L,
21L, 29L, 23L, 32L, 11L, 12L, 14L, 31L, 15L, 17L, 28L, 20L, 22L,
30L, 24L, 34L, 36L, 37L, 35L, 39L, 38L, 45L, 42L, 48L, 50L, 51L,
52L, 53L, 54L, 55L, 40L), class = "data.frame")
原始数据的dput
structure(list(AccountID = c(17L, 17L, 17L, 17L, 17L, 17L, 17L,
17L, 17L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L,
359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L,
359L, 359L, 359L, 359L, 365L, 939L, 939L, 939L, 997L, 997L, 1181L
), PaymentDate = c("04-02-2010", "04-02-2010", "04-02-2010",
"05-02-2010", "05-02-2010", "05-02-2010", "05-02-2010", "09-02-2010",
"25-01-2010", "01-01-2009", "01-02-2010", "01-03-2010", "01-04-2009",
"01-04-2010", "01-06-2010", "01-07-2009", "01-07-2010", "01-08-2009",
"01-09-2009", "01-09-2010", "01-10-2009", "01-10-2010", "01-12-2009",
"01-12-2010", "02-02-2009", "02-03-2009", "02-06-2009", "02-08-2010",
"02-11-2009", "02-11-2010", "03-05-2010", "04-01-2010", "04-05-2009",
"17-07-2009", "08-04-2010", "17-03-2009", "28-12-2009", "08-12-2010",
"22-05-2009", "03-11-2009")), .Names = c("AccountID", "PaymentDate"
), row.names = c(NA, 40L), class = "data.frame")
I got a transaction database which looks like this:
AccountID PaymentDate PaymentAmount
8 13 2020-02-09 1.00
9 13 2020-01-25 4.20
10 14 2020-01-01 30.68
11 14 2020-02-01 30.68
PaymentDate is in posix format. With transaction data I don't want to aggregate by time intervals (this is well documented), but by ID.
Using min() with Posix times gives the first day, max() the last day. This is the information I need for every ID.
Ok, here is what I tried:
# 1.
summaryBy(PaymentDate ~ AccountID, data1, FUN=c(min) )
Fehler in tapply(lh.data[, lh.var[vv]], rh.string.factor, function(x) { : arguments must have same length
# 2.
ddply( data1, "AccountID", summarise, min(PaymentDate))
# returns 0 and warnings:
50: In output[[var]][rng] <- df[[var]] : Anzahl der zu ersetzenden Elemente ist kein Vielfaches der Ersetzungslänge
# 3.
aggregate(PaymentDate ~ AccountID, data1, min)
Fehler in model.frame.default(formula = PaymentDate ~ AccountID, data = data1) : ungültiger Typ (list) für die Variable 'PaymentDate'
Apparently, aggregate does not work with posix time if you need aggregates of time rather than aggregates by time.
But getting first and last transaction date must be possible?!
Reply
Ok, since I cannot answer my own question yet, I post it here:
Interesting. Thank you!
I typically use the as.is=T option with read.csv, then convert time using strptime. So when I look at the structure of my data I get:
$ PaymentDate : POSIXlt, format: "2020-02-04" "2020-02-04" "2020-02-04" ...
To me, that does not look like a factor. I can use min() and max() on the whole column and it works. Apparently, POSIXlt is more troublesome than I thought. Coming from the POSIXlt, I did
data$PaymentDate=as.Date(data$PaymentDate)
Looking at the structure, the Class is correctly set as Date.
$ PaymentDate :Class 'Date' num [1:10000] 18296 18296 18296 18297 18297 ...
Now it appears to work. However, only ddply return the right format "2020-01-25", while both aggregate and summaryBy return in the "18286" format. Was that days since 1970-01-01? Well, I guess I can convert it back.
foo=aggregate(PaymentDate ~ AccountID, data1, min)
as.Date(foo$PaymentDate,origin="1970-01-01")
However, there has to be some explanation. Also, ddply is much slower.
Oh, why do I use strptime first? Well, the date in the original file is in a different format, "%d-%m-%y". Using as.Date directly on this did not appear to work.
EDIT
dput of my data
structure(list(AccountID = c(17L, 17L, 17L, 17L, 17L, 17L, 17L,
17L, 17L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L,
359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L,
359L, 359L, 359L, 359L, 365L, 939L, 939L, 939L, 997L, 997L, 1181L
), PaymentDate = structure(list(sec = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), min = c(0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L), hour = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L),
mday = c(4L, 4L, 4L, 5L, 5L, 5L, 5L, 9L, 25L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 3L, 4L, 4L, 17L, 8L, 17L, 28L, 8L, 22L, 3L),
mon = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 2L, 3L,
3L, 5L, 6L, 6L, 7L, 8L, 8L, 9L, 9L, 11L, 11L, 1L, 2L, 5L,
7L, 10L, 10L, 4L, 0L, 4L, 6L, 3L, 2L, 11L, 11L, 4L, 10L),
year = c(110L, 110L, 110L, 110L, 110L, 110L, 110L, 110L,
110L, 109L, 110L, 110L, 109L, 110L, 110L, 109L, 110L, 109L,
109L, 110L, 109L, 110L, 109L, 110L, 109L, 109L, 109L, 110L,
109L, 110L, 110L, 110L, 109L, 109L, 110L, 109L, 109L, 110L,
109L, 109L), wday = c(4L, 4L, 4L, 5L, 5L, 5L, 5L, 2L, 1L,
4L, 1L, 1L, 3L, 4L, 2L, 3L, 4L, 6L, 2L, 3L, 4L, 5L, 2L, 3L,
1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 5L, 4L, 2L, 1L, 3L, 5L,
2L), yday = c(34L, 34L, 34L, 35L, 35L, 35L, 35L, 39L, 24L,
0L, 31L, 59L, 90L, 90L, 151L, 181L, 181L, 212L, 243L, 243L,
273L, 273L, 334L, 334L, 32L, 60L, 152L, 213L, 305L, 305L,
122L, 3L, 123L, 197L, 97L, 75L, 361L, 341L, 141L, 306L),
isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT")), .Names = c("AccountID",
"PaymentDate"), row.names = c(NA, 40L), class = "data.frame")
dput after doing as you suggested:
structure(list(AccountID = c(359L, 359L, 359L, 359L, 359L, 359L,
359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L,
359L, 359L, 359L, 359L, 359L, 359L, 359L, 365L, 939L, 939L, 939L,
997L, 997L, 1181L, 1181L, 1181L, 1181L, 1181L, 1181L, 1181L,
1181L, 1181L, 1181L), PaymentDate = structure(c(14245, 14277,
14305, 14335, 14368, 14397, 14426, 14457, 14488, 14518, 14550,
14579, 14613, 14641, 14669, 14700, 14732, 14761, 14791, 14823,
14853, 14883, 14915, 14944, 14442, 14320, 14606, 14707, 14386,
14951, 14293, 14432, 14477, 14540, 14540, 14540, 14540, 14540,
14540, 14551), class = "Date")), .Names = c("AccountID", "PaymentDate"
), row.names = c(10L, 25L, 26L, 13L, 33L, 27L, 16L, 18L, 19L,
21L, 29L, 23L, 32L, 11L, 12L, 14L, 31L, 15L, 17L, 28L, 20L, 22L,
30L, 24L, 34L, 36L, 37L, 35L, 39L, 38L, 45L, 42L, 48L, 50L, 51L,
52L, 53L, 54L, 55L, 40L), class = "data.frame")
dput of raw data
structure(list(AccountID = c(17L, 17L, 17L, 17L, 17L, 17L, 17L,
17L, 17L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L,
359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L,
359L, 359L, 359L, 359L, 365L, 939L, 939L, 939L, 997L, 997L, 1181L
), PaymentDate = c("04-02-2010", "04-02-2010", "04-02-2010",
"05-02-2010", "05-02-2010", "05-02-2010", "05-02-2010", "09-02-2010",
"25-01-2010", "01-01-2009", "01-02-2010", "01-03-2010", "01-04-2009",
"01-04-2010", "01-06-2010", "01-07-2009", "01-07-2010", "01-08-2009",
"01-09-2009", "01-09-2010", "01-10-2009", "01-10-2010", "01-12-2009",
"01-12-2010", "02-02-2009", "02-03-2009", "02-06-2009", "02-08-2010",
"02-11-2009", "02-11-2010", "03-05-2010", "04-01-2010", "04-05-2009",
"17-07-2009", "08-04-2010", "17-03-2009", "28-12-2009", "08-12-2010",
"22-05-2009", "03-11-2009")), .Names = c("AccountID", "PaymentDate"
), row.names = c(NA, 40L), class = "data.frame")
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
问题在于您的数据,特别是 PaymentDate 列,是一个因素。如果您首先转换 PaymentDate 列,则
ddply
和aggregate
解决方案都将按照编写的方式工作:将变量类更改为 Date。
然后运行你的原始代码。使用 ddply:
使用aggregate:
还有另一种更通用的方法可以避免此问题。默认情况下,当您使用
read.table
(或其变体,如read.csv
)创建 data.frame 时,会设置参数stringsAsFactors
为TRUE
。当我使用stringsAsFactors=FALSE
重新创建您的数据时,您不需要转换 PaymentDate 的中间步骤,并且您的代码按编写的方式工作:The problem is that your data, and in particular the PaymentDate column, is a factor. Both your
ddply
andaggregate
solutions will work as written if you first transform the PaymentDate column:Change variable class to Date.
Then run your original code. Using ddply:
Using aggregate:
There is another, more general way to avoid this problem. By default, when you create a data.frame by using
read.table
(or its variants likeread.csv
) the parameterstringsAsFactors
is set toTRUE
. When I recreate your data usingstringsAsFactors=FALSE
, you don't need the intermediate step of transforming PaymentDate, and your code works as written: