按 ID 聚合并找到时间的 min() 和 max()

发布于 2024-11-16 17:30:39 字数 6269 浏览 6 评论 0原文

我有一个交易数据库，如下所示：

   AccountID PaymentDate PaymentAmount 
8         13  2020-02-09          1.00 
9         13  2020-01-25          4.20   
10        14  2020-01-01         30.68 
11        14  2020-02-01         30.68

PaymentDate 采用 posix 格式。对于交易数据，我不想按时间间隔进行聚合（这是有详细记录的），而是按 ID 进行聚合。

将 min() 与 Posix 时间一起使用给出第一天，使用 max() 给出最后一天。这是我每个 ID 都需要的信息。

好的，这是我尝试过的：

# 1.
summaryBy(PaymentDate ~ AccountID, data1, FUN=c(min) )
Fehler in tapply(lh.data[, lh.var[vv]], rh.string.factor, function(x) { : arguments must have same length

# 2.
ddply( data1, "AccountID", summarise, min(PaymentDate))
# returns 0 and warnings:
50: In output[[var]][rng] <- df[[var]] : Anzahl der zu ersetzenden Elemente ist kein Vielfaches der Ersetzungslänge

# 3.
aggregate(PaymentDate ~ AccountID, data1, min)
Fehler in model.frame.default(formula = PaymentDate ~ AccountID, data = data1) : ungültiger Typ (list) für die Variable 'PaymentDate'

显然，如果您需要时间聚合而不是按时间聚合，则聚合不适用于 posix 时间。

但获取第一个和最后一个交易日期一定是可能的吗？！

好的

，因为我还不能回答我自己的问题，所以我将其发布在这里：

有趣。谢谢你！

我通常将 as.is=T 选项与 read.csv 一起使用，然后使用 strptime 转换时间。因此，当我查看数据结构时，我得到：

$ PaymentDate    : POSIXlt, format: "2020-02-04" "2020-02-04" "2020-02-04" ...

对我来说，这看起来不是一个因素。我可以在整个列上使用 min() 和 max() 并且它可以工作。显然，POSIXlt比我想象的更麻烦。来自 POSIXlt，我

data$PaymentDate=as.Date(data$PaymentDate)

查看了结构，类被正确设置为日期。

$ PaymentDate    :Class 'Date'  num [1:10000] 18296 18296 18296 18297 18297 ...

现在看来有效了。但是，只有ddply返回正确的格式“2020-01-25”，而aggregate和summaryBy都以“18286”格式返回。那是自 1970 年 1 月 1 日以来的日子吗？好吧，我想我可以把它转换回来。

foo=aggregate(PaymentDate ~ AccountID, data1, min)
as.Date(foo$PaymentDate,origin="1970-01-01")

然而，必须有一些解释。而且，ddply 慢得多。

哦，为什么我先用strptime？嗯，原始文件中的日期采用不同的格式，“%d-%m-%y”。直接使用 as.Date 似乎不起作用。

编辑

我的数据的dput

structure(list(AccountID = c(17L, 17L, 17L, 17L, 17L, 17L, 17L, 
17L, 17L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 
359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 
359L, 359L, 359L, 359L, 365L, 939L, 939L, 939L, 997L, 997L, 1181L
), PaymentDate = structure(list(sec = c(0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), min = c(0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L), hour = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), 
    mday = c(4L, 4L, 4L, 5L, 5L, 5L, 5L, 9L, 25L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
    2L, 2L, 2L, 3L, 4L, 4L, 17L, 8L, 17L, 28L, 8L, 22L, 3L), 
    mon = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 2L, 3L, 
    3L, 5L, 6L, 6L, 7L, 8L, 8L, 9L, 9L, 11L, 11L, 1L, 2L, 5L, 
    7L, 10L, 10L, 4L, 0L, 4L, 6L, 3L, 2L, 11L, 11L, 4L, 10L), 
    year = c(110L, 110L, 110L, 110L, 110L, 110L, 110L, 110L, 
    110L, 109L, 110L, 110L, 109L, 110L, 110L, 109L, 110L, 109L, 
    109L, 110L, 109L, 110L, 109L, 110L, 109L, 109L, 109L, 110L, 
    109L, 110L, 110L, 110L, 109L, 109L, 110L, 109L, 109L, 110L, 
    109L, 109L), wday = c(4L, 4L, 4L, 5L, 5L, 5L, 5L, 2L, 1L, 
    4L, 1L, 1L, 3L, 4L, 2L, 3L, 4L, 6L, 2L, 3L, 4L, 5L, 2L, 3L, 
    1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 5L, 4L, 2L, 1L, 3L, 5L, 
    2L), yday = c(34L, 34L, 34L, 35L, 35L, 35L, 35L, 39L, 24L, 
    0L, 31L, 59L, 90L, 90L, 151L, 181L, 181L, 212L, 243L, 243L, 
    273L, 273L, 334L, 334L, 32L, 60L, 152L, 213L, 305L, 305L, 
    122L, 3L, 123L, 197L, 97L, 75L, 361L, 341L, 141L, 306L), 
    isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("sec", 
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT")), .Names = c("AccountID", 
"PaymentDate"), row.names = c(NA, 40L), class = "data.frame")

按照您的建议进行操作后的dput：

structure(list(AccountID = c(359L, 359L, 359L, 359L, 359L, 359L, 
359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 
359L, 359L, 359L, 359L, 359L, 359L, 359L, 365L, 939L, 939L, 939L, 
997L, 997L, 1181L, 1181L, 1181L, 1181L, 1181L, 1181L, 1181L, 
1181L, 1181L, 1181L), PaymentDate = structure(c(14245, 14277, 
14305, 14335, 14368, 14397, 14426, 14457, 14488, 14518, 14550, 
14579, 14613, 14641, 14669, 14700, 14732, 14761, 14791, 14823, 
14853, 14883, 14915, 14944, 14442, 14320, 14606, 14707, 14386, 
14951, 14293, 14432, 14477, 14540, 14540, 14540, 14540, 14540, 
14540, 14551), class = "Date")), .Names = c("AccountID", "PaymentDate"
), row.names = c(10L, 25L, 26L, 13L, 33L, 27L, 16L, 18L, 19L, 
21L, 29L, 23L, 32L, 11L, 12L, 14L, 31L, 15L, 17L, 28L, 20L, 22L, 
30L, 24L, 34L, 36L, 37L, 35L, 39L, 38L, 45L, 42L, 48L, 50L, 51L, 
52L, 53L, 54L, 55L, 40L), class = "data.frame")

原始数据的dput

structure(list(AccountID = c(17L, 17L, 17L, 17L, 17L, 17L, 17L, 
17L, 17L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 
359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 
359L, 359L, 359L, 359L, 365L, 939L, 939L, 939L, 997L, 997L, 1181L
), PaymentDate = c("04-02-2010", "04-02-2010", "04-02-2010", 
"05-02-2010", "05-02-2010", "05-02-2010", "05-02-2010", "09-02-2010", 
"25-01-2010", "01-01-2009", "01-02-2010", "01-03-2010", "01-04-2009", 
"01-04-2010", "01-06-2010", "01-07-2009", "01-07-2010", "01-08-2009", 
"01-09-2009", "01-09-2010", "01-10-2009", "01-10-2010", "01-12-2009", 
"01-12-2010", "02-02-2009", "02-03-2009", "02-06-2009", "02-08-2010", 
"02-11-2009", "02-11-2010", "03-05-2010", "04-01-2010", "04-05-2009", 
"17-07-2009", "08-04-2010", "17-03-2009", "28-12-2009", "08-12-2010", 
"22-05-2009", "03-11-2009")), .Names = c("AccountID", "PaymentDate"
), row.names = c(NA, 40L), class = "data.frame")

原文

I got a transaction database which looks like this:

   AccountID PaymentDate PaymentAmount 
8         13  2020-02-09          1.00 
9         13  2020-01-25          4.20   
10        14  2020-01-01         30.68 
11        14  2020-02-01         30.68

PaymentDate is in posix format. With transaction data I don't want to aggregate by time intervals (this is well documented), but by ID.

Using min() with Posix times gives the first day, max() the last day. This is the information I need for every ID.

Ok, here is what I tried:

# 1.
summaryBy(PaymentDate ~ AccountID, data1, FUN=c(min) )
Fehler in tapply(lh.data[, lh.var[vv]], rh.string.factor, function(x) { : arguments must have same length

# 2.
ddply( data1, "AccountID", summarise, min(PaymentDate))
# returns 0 and warnings:
50: In output[[var]][rng] <- df[[var]] : Anzahl der zu ersetzenden Elemente ist kein Vielfaches der Ersetzungslänge

# 3.
aggregate(PaymentDate ~ AccountID, data1, min)
Fehler in model.frame.default(formula = PaymentDate ~ AccountID, data = data1) : ungültiger Typ (list) für die Variable 'PaymentDate'

Apparently, aggregate does not work with posix time if you need aggregates of time rather than aggregates by time.

But getting first and last transaction date must be possible?!

Reply

Ok, since I cannot answer my own question yet, I post it here:

Interesting. Thank you!

I typically use the as.is=T option with read.csv, then convert time using strptime. So when I look at the structure of my data I get:

$ PaymentDate    : POSIXlt, format: "2020-02-04" "2020-02-04" "2020-02-04" ...

To me, that does not look like a factor. I can use min() and max() on the whole column and it works. Apparently, POSIXlt is more troublesome than I thought. Coming from the POSIXlt, I did

data$PaymentDate=as.Date(data$PaymentDate)

Looking at the structure, the Class is correctly set as Date.

$ PaymentDate    :Class 'Date'  num [1:10000] 18296 18296 18296 18297 18297 ...

Now it appears to work. However, only ddply return the right format "2020-01-25", while both aggregate and summaryBy return in the "18286" format. Was that days since 1970-01-01? Well, I guess I can convert it back.

foo=aggregate(PaymentDate ~ AccountID, data1, min)
as.Date(foo$PaymentDate,origin="1970-01-01")

However, there has to be some explanation. Also, ddply is much slower.

Oh, why do I use strptime first? Well, the date in the original file is in a different format, "%d-%m-%y". Using as.Date directly on this did not appear to work.

EDIT

dput of my data

structure(list(AccountID = c(17L, 17L, 17L, 17L, 17L, 17L, 17L, 
17L, 17L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 
359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 
359L, 359L, 359L, 359L, 365L, 939L, 939L, 939L, 997L, 997L, 1181L
), PaymentDate = structure(list(sec = c(0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), min = c(0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L), hour = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), 
    mday = c(4L, 4L, 4L, 5L, 5L, 5L, 5L, 9L, 25L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
    2L, 2L, 2L, 3L, 4L, 4L, 17L, 8L, 17L, 28L, 8L, 22L, 3L), 
    mon = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 2L, 3L, 
    3L, 5L, 6L, 6L, 7L, 8L, 8L, 9L, 9L, 11L, 11L, 1L, 2L, 5L, 
    7L, 10L, 10L, 4L, 0L, 4L, 6L, 3L, 2L, 11L, 11L, 4L, 10L), 
    year = c(110L, 110L, 110L, 110L, 110L, 110L, 110L, 110L, 
    110L, 109L, 110L, 110L, 109L, 110L, 110L, 109L, 110L, 109L, 
    109L, 110L, 109L, 110L, 109L, 110L, 109L, 109L, 109L, 110L, 
    109L, 110L, 110L, 110L, 109L, 109L, 110L, 109L, 109L, 110L, 
    109L, 109L), wday = c(4L, 4L, 4L, 5L, 5L, 5L, 5L, 2L, 1L, 
    4L, 1L, 1L, 3L, 4L, 2L, 3L, 4L, 6L, 2L, 3L, 4L, 5L, 2L, 3L, 
    1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 5L, 4L, 2L, 1L, 3L, 5L, 
    2L), yday = c(34L, 34L, 34L, 35L, 35L, 35L, 35L, 39L, 24L, 
    0L, 31L, 59L, 90L, 90L, 151L, 181L, 181L, 212L, 243L, 243L, 
    273L, 273L, 334L, 334L, 32L, 60L, 152L, 213L, 305L, 305L, 
    122L, 3L, 123L, 197L, 97L, 75L, 361L, 341L, 141L, 306L), 
    isdst = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("sec", 
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "GMT")), .Names = c("AccountID", 
"PaymentDate"), row.names = c(NA, 40L), class = "data.frame")

dput after doing as you suggested:

structure(list(AccountID = c(359L, 359L, 359L, 359L, 359L, 359L, 
359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 
359L, 359L, 359L, 359L, 359L, 359L, 359L, 365L, 939L, 939L, 939L, 
997L, 997L, 1181L, 1181L, 1181L, 1181L, 1181L, 1181L, 1181L, 
1181L, 1181L, 1181L), PaymentDate = structure(c(14245, 14277, 
14305, 14335, 14368, 14397, 14426, 14457, 14488, 14518, 14550, 
14579, 14613, 14641, 14669, 14700, 14732, 14761, 14791, 14823, 
14853, 14883, 14915, 14944, 14442, 14320, 14606, 14707, 14386, 
14951, 14293, 14432, 14477, 14540, 14540, 14540, 14540, 14540, 
14540, 14551), class = "Date")), .Names = c("AccountID", "PaymentDate"
), row.names = c(10L, 25L, 26L, 13L, 33L, 27L, 16L, 18L, 19L, 
21L, 29L, 23L, 32L, 11L, 12L, 14L, 31L, 15L, 17L, 28L, 20L, 22L, 
30L, 24L, 34L, 36L, 37L, 35L, 39L, 38L, 45L, 42L, 48L, 50L, 51L, 
52L, 53L, 54L, 55L, 40L), class = "data.frame")

dput of raw data

structure(list(AccountID = c(17L, 17L, 17L, 17L, 17L, 17L, 17L, 
17L, 17L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 
359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 359L, 
359L, 359L, 359L, 359L, 365L, 939L, 939L, 939L, 997L, 997L, 1181L
), PaymentDate = c("04-02-2010", "04-02-2010", "04-02-2010", 
"05-02-2010", "05-02-2010", "05-02-2010", "05-02-2010", "09-02-2010", 
"25-01-2010", "01-01-2009", "01-02-2010", "01-03-2010", "01-04-2009", 
"01-04-2010", "01-06-2010", "01-07-2009", "01-07-2010", "01-08-2009", 
"01-09-2009", "01-09-2010", "01-10-2009", "01-10-2010", "01-12-2009", 
"01-12-2010", "02-02-2009", "02-03-2009", "02-06-2009", "02-08-2010", 
"02-11-2009", "02-11-2010", "03-05-2010", "04-01-2010", "04-05-2009", 
"17-07-2009", "08-04-2010", "17-03-2009", "28-12-2009", "08-12-2010", 
"22-05-2009", "03-11-2009")), .Names = c("AccountID", "PaymentDate"
), row.names = c(NA, 40L), class = "data.frame")

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

情深缘浅 2024-11-23 17:30:39

问题在于您的数据，特别是 PaymentDate 列，是一个因素。如果您首先转换 PaymentDate 列，则 ddply 和 aggregate 解决方案都将按照编写的方式工作：

#Recreate data and use dput() to replicate
df <- structure(list(AccountID = c(13L, 13L, 14L, 14L), PaymentDate = c("2020-02-09", 
            "2020-01-25", "2020-01-01", "2020-02-01"), PaymentAmount = c(1, 
            4.2, 30.68, 30.68)), .Names = c("AccountID", "PaymentDate", "PaymentAmount"
    ), class = "data.frame", row.names = c("8", "9", "10", "11"))

将变量类更改为 Date。

df$PaymentDate <- as.Date(df$PaymentDate)

然后运行你的原始代码。使用 ddply：

ddply(df, .(AccountID), summarize, data=min(PaymentDate))
  AccountID       data
1        13 2020-01-25
2        14 2020-01-01

使用aggregate：

aggregate(PaymentDate ~ AccountID, df, min)
  AccountID PaymentDate
1        13  2020-01-25
2        14  2020-01-01

还有另一种更通用的方法可以避免此问题。默认情况下，当您使用 read.table（或其变体，如 read.csv）创建 data.frame 时，会设置参数 stringsAsFactors为TRUE。当我使用 stringsAsFactors=FALSE 重新创建您的数据时，您不需要转换 PaymentDate 的中间步骤，并且您的代码按编写的方式工作：

dat <- "   AccountID PaymentDate PaymentAmount 
8         13  2020-02-09          1.00 
9         13  2020-01-25          4.20   
10        14  2020-01-01         30.68 
11        14  2020-02-01         30.68 "

df <- read.table(textConnection(dat), stringsAsFactors=FALSE)
df

ddply(df, .(AccountID), summarize, data=min(PaymentDate))
  AccountID       data
1        13 2020-01-25
2        14 2020-01-01

The problem is that your data, and in particular the PaymentDate column, is a factor. Both your ddply and aggregate solutions will work as written if you first transform the PaymentDate column:

#Recreate data and use dput() to replicate
df <- structure(list(AccountID = c(13L, 13L, 14L, 14L), PaymentDate = c("2020-02-09", 
            "2020-01-25", "2020-01-01", "2020-02-01"), PaymentAmount = c(1, 
            4.2, 30.68, 30.68)), .Names = c("AccountID", "PaymentDate", "PaymentAmount"
    ), class = "data.frame", row.names = c("8", "9", "10", "11"))

Change variable class to Date.

df$PaymentDate <- as.Date(df$PaymentDate)

Then run your original code. Using ddply:

ddply(df, .(AccountID), summarize, data=min(PaymentDate))
  AccountID       data
1        13 2020-01-25
2        14 2020-01-01

Using aggregate:

aggregate(PaymentDate ~ AccountID, df, min)
  AccountID PaymentDate
1        13  2020-01-25
2        14  2020-01-01

There is another, more general way to avoid this problem. By default, when you create a data.frame by using read.table (or its variants like read.csv) the parameter stringsAsFactors is set to TRUE. When I recreate your data using stringsAsFactors=FALSE, you don't need the intermediate step of transforming PaymentDate, and your code works as written:

dat <- "   AccountID PaymentDate PaymentAmount 
8         13  2020-02-09          1.00 
9         13  2020-01-25          4.20   
10        14  2020-01-01         30.68 
11        14  2020-02-01         30.68 "

df <- read.table(textConnection(dat), stringsAsFactors=FALSE)
df

ddply(df, .(AccountID), summarize, data=min(PaymentDate))
  AccountID       data
1        13 2020-01-25
2        14 2020-01-01

回复收藏 0 原文

~没有更多了~