给定许多唯一 ID 的时间序列,我需要每个时期的前 100 个增量

发布于 2024-08-16 01:38:17 字数 561 浏览 4 评论 0原文

我在 TSV 中有一个时间序列数据,如下所示:

ID \t Date \t Value
-------------------------------
1234567 \t 2009-01-01T00:00:00.000Z \t 121
12131 \t 2009-06-01T00:00:00.000Z \t 151
12131 \t 2009-07-01T00:00:00.000Z \t 15153
...

它很容易适合 RAM,但对于 Excel 来说太大了。

每个 ID 每月有一个值,但并非所有 ID 都包含所有 12 个月的条目。

数据跨度为 12 个月,但并非所有 ID 都有全部 12 个月。我想遍历每个 ID 的数据,如果有上个月的条目,则将当月减去上个月并将其存储在新列中以获得增量。如果上个月没有条目,则返回 0。然后,对于每个月,我想要这些增量的前 100 个正值和负值以及 ID。

我想在 R 中执行此操作,因为这在 Excel 中很难,而且总是崩溃。我安装了 R、Rattle 等,并且已经完成了基本示例,但是......学习曲线很陡峭。我真的很感激一些帮助:)

I have a time series of data in TSV like so:

ID \t Date \t Value
-------------------------------
1234567 \t 2009-01-01T00:00:00.000Z \t 121
12131 \t 2009-06-01T00:00:00.000Z \t 151
12131 \t 2009-07-01T00:00:00.000Z \t 15153
...

It easily fits in RAM, but is too big for Excel.

There is one value per month per ID, but not all IDs have entries for all 12 months.

The data spans 12 months, but not all IDs have all 12 months. I want to go through the data for each ID, and if there is an entry for the previous month, take the current month minus the previous month and store it in a new column to get a delta. If there is no entry for the previous month, then return 0. Then, for each month, I want the top 100 positive and negative of those deltas, along with the ID.

I'd like to do this in R, because it's hard in Excel and it keeps crashing. I have R, Rattle, etc. installed and I've worked through basic examples, but ... the learning curve is steep. I would really appreciate some help :)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

无远思近则忧 2024-08-23 01:38:17

首先添加所有缺失的月份:

all_combs <- expand.grid(
  ID = unique(data$ID),
  Date = unique(data$Date))

data <- merge(data, all_combs, by = c("ID", "Date"), all = T)
# Ensure data ordered by date
data <- data[with(data, order(ID, Date)), ]

然后添加一列增量(使用 diff 计算)

library(plyr)
data <- ddply(data, "ID", transform, delta = c(NA, diff(Value)))

最后,删除缺失的增量,按其值排序并提取每组中前 10 个和后 10 个。

changed <- subset(data, !is.na(delta))
changed <- changed[with(changed, order(ID, delta)), ]

# Select top 100 for each
top10 <- ddply(changed, "ID", function(df) {
 rbind(head(df, 10), tail(df, 10))
})

Start by adding in all missing months:

all_combs <- expand.grid(
  ID = unique(data$ID),
  Date = unique(data$Date))

data <- merge(data, all_combs, by = c("ID", "Date"), all = T)
# Ensure data ordered by date
data <- data[with(data, order(ID, Date)), ]

Then add a column of deltas (calculated with diff)

library(plyr)
data <- ddply(data, "ID", transform, delta = c(NA, diff(Value)))

Finally, remove missing deltas, order by their value and extract the top and bottom 10 within each group.

changed <- subset(data, !is.na(delta))
changed <- changed[with(changed, order(ID, delta)), ]

# Select top 100 for each
top10 <- ddply(changed, "ID", function(df) {
 rbind(head(df, 10), tail(df, 10))
})
独﹏钓一江月 2024-08-23 01:38:17

好的,首先是一些代码来生成一些测试数据。这会生成 100 个随机 ID,并且每个 ID 都会选择 2 年期间中的 20 个月以及随机值。然后顺序会被打乱,以获得额外的乐趣。

## Generate some IDs
ids <- sample(1000, 100)

## Generate the data
data <- do.call(rbind,
                lapply(ids,
                       function(id)
                       data.frame(ID = id,
                                  Date = sample(as.Date(paste(rep(c(2008:2009), each=12),
                                    1:12, 1, sep="-")),
                                    20),
                                  Value = sample(1000, 20))))

## Shuffle
data <- data[sample(nrow(data), nrow(data)),]

对我来说,它是这样的:

> head(data)
      ID       Date Value
1007 205 2008-07-01   235
1391 840 2008-12-01   509
918  278 2009-12-01   951
1213 945 2009-03-01   842
1369 766 2009-07-01   555
798  662 2008-12-01   531

好的,现在让我们迭代 ID 并查找每个 ID 每月的差异。在此之前,让我们将月份转换为数字,以便更容易获取差异(这有点不干净,有人知道对 Date 对象进行算术的更好方法吗?)。这只是年 * 12 + 月,以便正常算术工作:

data$Month <- as.POSIXlt(data$Date)$mon + as.POSIXlt(data$Date)$year * 12

现在计算差异:

by.id <- by(data, data$ID, function(x) {
  ## Sort by month.
  x <- x[order(x$Month),]
  ## Compute the month and value differences, taking care to pad the edge case.
  data.frame(ID=x$ID,
             Date = x$Date,             
             Month.diff=c(0, diff(x$Month)),
             Value.diff=c(0,diff(x$Value)))
})
by.id <- do.call(rbind, by.id)

结果如下:

> head(by.id)
    ID       Date Month.diff Value.diff
4.1  4 2008-02-01          0          0
4.2  4 2008-03-01          1        123
4.3  4 2008-05-01          2        -94
4.4  4 2008-06-01          1       -243
4.5  4 2008-08-01          2       -327
4.6  4 2008-10-01          2        656

如果连续月份之间的差异大于 1,则这些月份不相邻我们应该将它们的值设置为零。

by.id$Value.diff <- ifelse(by.id$Month.diff == 1,
                           by.id$Value.diff,
                           0)

最后,我们按月进行迭代,并取顶部和底部的 N 个差异(我在这里将 N 设置为 10,而不是 100,因为我的测试数据集相当小)。

by.month <- by(by.id, by.id$Date, function(x) {
  ## Sort the data in each month
  x <- x[order(x$Value.diff),]
  ## Take the top and bottom and label them accordingly.
  cbind(rbind(head(x, 10), tail(x, 10)),
        type=rep(c("min", "max"), each=10))
})

现在我们就有了。这是一个示例结果:

> by.month[[24]]
        ID       Date Month.diff Value.diff type
130.20 130 2009-12-01          1       -951  min
415.20 415 2009-12-01          1       -895  min
662.20 662 2009-12-01          1       -878  min
107.20 107 2009-12-01          1       -744  min
824.20 824 2009-12-01          1       -731  min
170.20 170 2009-12-01          1       -719  min
502.20 502 2009-12-01          1       -714  min
247.20 247 2009-12-01          1       -697  min
789.20 789 2009-12-01          1       -667  min
132.20 132 2009-12-01          1       -653  min
64.20   64 2009-12-01          1        622  max
82.20   82 2009-12-01          1        647  max
381.20 381 2009-12-01          1        698  max
303.20 303 2009-12-01          1        700  max
131.20 131 2009-12-01          1        751  max
221.20 221 2009-12-01          1        765  max
833.20 833 2009-12-01          1        791  max
806.20 806 2009-12-01          1        806  max
780.20 780 2009-12-01          1        843  max
912.20 912 2009-12-01          1        929  max

Ok, first some code to generate some test data. This makes 100 random IDs and for each one chooses 20 months from a 2 year period along with random values. The order is then shuffled for extra fun.

## Generate some IDs
ids <- sample(1000, 100)

## Generate the data
data <- do.call(rbind,
                lapply(ids,
                       function(id)
                       data.frame(ID = id,
                                  Date = sample(as.Date(paste(rep(c(2008:2009), each=12),
                                    1:12, 1, sep="-")),
                                    20),
                                  Value = sample(1000, 20))))

## Shuffle
data <- data[sample(nrow(data), nrow(data)),]

Here's what it looks like for me:

> head(data)
      ID       Date Value
1007 205 2008-07-01   235
1391 840 2008-12-01   509
918  278 2009-12-01   951
1213 945 2009-03-01   842
1369 766 2009-07-01   555
798  662 2008-12-01   531

Ok, now let's iterate over IDs and find the diff for each month for each ID. Before that, let's convert the month to a number so it'll be easier to take differences (this is a bit unclean, does anyone know a better way to do arithmetic on Date objects?). This just does year * 12 + month so that normal arithmetic works:

data$Month <- as.POSIXlt(data$Date)$mon + as.POSIXlt(data$Date)$year * 12

Now compute the differences:

by.id <- by(data, data$ID, function(x) {
  ## Sort by month.
  x <- x[order(x$Month),]
  ## Compute the month and value differences, taking care to pad the edge case.
  data.frame(ID=x$ID,
             Date = x$Date,             
             Month.diff=c(0, diff(x$Month)),
             Value.diff=c(0,diff(x$Value)))
})
by.id <- do.call(rbind, by.id)

Here's what the result looks like:

> head(by.id)
    ID       Date Month.diff Value.diff
4.1  4 2008-02-01          0          0
4.2  4 2008-03-01          1        123
4.3  4 2008-05-01          2        -94
4.4  4 2008-06-01          1       -243
4.5  4 2008-08-01          2       -327
4.6  4 2008-10-01          2        656

If the difference between consecutive months was greater than 1, then the months were not adjacent and we should set their values to zero.

by.id$Value.diff <- ifelse(by.id$Month.diff == 1,
                           by.id$Value.diff,
                           0)

Finally, we iterate by month and take the top and bottom N differences (I'll set N to 10 here rather than 100 since my test data set is rather small).

by.month <- by(by.id, by.id$Date, function(x) {
  ## Sort the data in each month
  x <- x[order(x$Value.diff),]
  ## Take the top and bottom and label them accordingly.
  cbind(rbind(head(x, 10), tail(x, 10)),
        type=rep(c("min", "max"), each=10))
})

And there we have it. Here's an example result:

> by.month[[24]]
        ID       Date Month.diff Value.diff type
130.20 130 2009-12-01          1       -951  min
415.20 415 2009-12-01          1       -895  min
662.20 662 2009-12-01          1       -878  min
107.20 107 2009-12-01          1       -744  min
824.20 824 2009-12-01          1       -731  min
170.20 170 2009-12-01          1       -719  min
502.20 502 2009-12-01          1       -714  min
247.20 247 2009-12-01          1       -697  min
789.20 789 2009-12-01          1       -667  min
132.20 132 2009-12-01          1       -653  min
64.20   64 2009-12-01          1        622  max
82.20   82 2009-12-01          1        647  max
381.20 381 2009-12-01          1        698  max
303.20 303 2009-12-01          1        700  max
131.20 131 2009-12-01          1        751  max
221.20 221 2009-12-01          1        765  max
833.20 833 2009-12-01          1        791  max
806.20 806 2009-12-01          1        806  max
780.20 780 2009-12-01          1        843  max
912.20 912 2009-12-01          1        929  max
半夏半凉 2024-08-23 01:38:17

伪代码开始:

For Each ID
  If Previous month data Exists 
    compute Diff
  Else diff = 0
return diff

For Each Month
  Max 100 (Positive)
  Min 100 (Negative)

#Realish Code
dataset$diff <- lappply(dataset,function(ID,month,value){IF dataset[month-1] = TRUE{value-(value[month-1]})})
#This gets tricky since you need to know the month and what the previous month is in a format you can test

Pseudo Code to start with:

For Each ID
  If Previous month data Exists 
    compute Diff
  Else diff = 0
return diff

For Each Month
  Max 100 (Positive)
  Min 100 (Negative)

#Realish Code
dataset$diff <- lappply(dataset,function(ID,month,value){IF dataset[month-1] = TRUE{value-(value[month-1]})})
#This gets tricky since you need to know the month and what the previous month is in a format you can test
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文