R 具有两个标识符的事件的重叠周期

发布于 2025-01-19 11:25:55 字数 1715 浏览 4 评论 0原文

我试图找到具有两个标识符的数据框的重叠周期，这与该主题中提出的其他问题不同，其中重叠观察必须仅针对一个标识符进行计数。

在我的数据框中，每笔交易都有一个唯一的 ID，并且进行交易的每个公司都有一个唯一的公司标识符。我试图找到每个公司标识符的重叠周期，按交易 ID 进行计数。

我使用 @Waldi 的代码创建了这个可重现的示例，在本主题中找到（我只在示例数据框中添加了公司标识符或“FirmID”）：

如何查找R 中非常大的数据集中具有重叠间隔日期的 ID 的比例

创建类似数据集的代码：

library (data.table)

size = 1e5

df <- data.frame(
  ID = sample(1:round(size / 5, 0)),
  FirmID = sample(1:20000),
  period = sample(c(5,10,30,45), size, replace = TRUE),
  start = sample(seq(
    as.Date('1999/01/01'), as.Date('2000/01/01'), by = "day"
  ), size, replace = TRUE)
) %>% mutate(end = start + period)

查找重叠周期的代码：

dt <- data.table(df, key = c("start", "end"))[, `:=`(row = 1:nrow(df))]

setkey(dt,FirmID,start,end)
foverlaps(dt,dt,by.x=c("FirmID","start","end"),by.y=c("FirmID","start","end"))[
  ,.(noverlap=.N),by=.(FirmID,row)][
    ,.(overlap = max(noverlap>1)),by=FirmID][
      ,.(n=.N),by=.(overlap)][
        ,pct:=n/sum(n)][]

此代码的唯一问题是它显示重叠公司的数量，而不是公司标识符内重叠交易的数量。

overlap     n     pct
1:       0  5333 0.26665
2:       1 14667 0.73335

如何针对公司标识符内重叠的交易 ID 更改此代码？我自己做的一项改变并没有产生令人满意的结果：

setkey(dt,FirmID,start,end)
foverlaps(dt,dt,by.x=c("FirmID","start","end"),by.y=c("FirmID","start","end"))[
  ,.(noverlap=.N),by=.(ID,row)][
    ,.(overlap = max(noverlap>1)),by=ID][
      ,.(n=.N),by=.(overlap)][
        ,pct:=n/sum(n)][]

   overlap     n     pct
1:       0  5333 0.26665
2:       1 14667 0.73335

原文

I am trying to find overlapping periods for a data frame with two identifiers, which differs from the other questions which have been asked on this topic where overlapping observations had to be counted for only one identifier.

In my data frame, every transaction has a unique ID and every firm which is transacted in has a unique firm identifier. I am trying to find overlapping periods per firm identifier, counted per transaction ID.

I used the code from @Waldi to create this reproducible example, found in this topic (I only added a firm identifier or "FirmID" to the sample data frame):

How to find the proportion of IDs with overlapping interval dates in very large datasets in R

Code to create a similar dataset:

library (data.table)

size = 1e5

df <- data.frame(
  ID = sample(1:round(size / 5, 0)),
  FirmID = sample(1:20000),
  period = sample(c(5,10,30,45), size, replace = TRUE),
  start = sample(seq(
    as.Date('1999/01/01'), as.Date('2000/01/01'), by = "day"
  ), size, replace = TRUE)
) %>% mutate(end = start + period)

The code to find overlapping periods:

dt <- data.table(df, key = c("start", "end"))[, `:=`(row = 1:nrow(df))]

setkey(dt,FirmID,start,end)
foverlaps(dt,dt,by.x=c("FirmID","start","end"),by.y=c("FirmID","start","end"))[
  ,.(noverlap=.N),by=.(FirmID,row)][
    ,.(overlap = max(noverlap>1)),by=FirmID][
      ,.(n=.N),by=.(overlap)][
        ,pct:=n/sum(n)][]

The only problem with this code, is that it displays the number of overlapping firms, instead of the number of overlapping transactions within a firm identifier.

overlap     n     pct
1:       0  5333 0.26665
2:       1 14667 0.73335

How can this code be altered for overlapping transaction IDs within a firm identifier? One change I made myself did not yield satisfactory results:

setkey(dt,FirmID,start,end)
foverlaps(dt,dt,by.x=c("FirmID","start","end"),by.y=c("FirmID","start","end"))[
  ,.(noverlap=.N),by=.(ID,row)][
    ,.(overlap = max(noverlap>1)),by=ID][
      ,.(n=.N),by=.(overlap)][
        ,pct:=n/sum(n)][]

   overlap     n     pct
1:       0  5333 0.26665
2:       1 14667 0.73335

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

比忠 2025-01-26 11:25:55

我很难理解另一种情况的解决方案，
所以我不知道如何调整该版本，
但我相信在这种情况下您可以通过内部非等连接实现您想要的结果：

dt <- data.table(df, key = c("FirmID", "ID", "start", "end"))
dt[, firm_total := .N, by = "FirmID"][
    dt, .(FirmID, firm_total), on = .(FirmID, ID < ID, start <= end, end >= start), nomatch = NULL, mult = "all"][
        , .(n = .N, pct = .N / (firm_total[1L] * (firm_total[1L] - 1L) / 2L)), by = "FirmID"]

我们避免使用 ID ID <; 的连接中出现冗余行。身份证号（请注意，此处应解释为左侧列的 ID < 右侧列的 ID）。可能会发生冗余，因为，如果 ID x 与 ID y 重叠， y 与 x 重叠。

如果你考虑所有的 ID 对并将它们放入一个矩阵中，
最大重叠数是下三角形中元素的数量，
可以用n * (n - 1) / 2计算，
这就是我们最初添加 firm_total 的原因。

我没有进行广泛的测试，但这个版本可能更适合这种情况。
foverlaps 的文档指出，它主要针对一个表比另一个表小得多的连接，
并解释了为什么它可能是一项昂贵的操作。
您正在进行自加入，
所以两张桌子的大小相同。

以及 table.express 版本的解决方案，因为为什么不呢：

library(table.express)

dt %>%
    group_by(FirmID) %>%
    mutate(firm_total = .N) %>%
    inner_join(dt, FirmID, ID < ID, start <= end, end >= start, .expr = TRUE) %>%
    select(FirmID, firm_total) %>%
    group_by(FirmID) %>%
    summarize(n = .N, pct = .N / (firm_total[1L] * (firm_total[1L] - 1L) / 2))

编辑：如果您想要重叠列，您可以使用下三角的大小 - n 来计算。

I had a hard time wrapping my head around the solution for the other scenario,
so I'm not sure how to adjust that version,
but I believe you can achieve the result you want in this case with an inner non-equi join:

dt <- data.table(df, key = c("FirmID", "ID", "start", "end"))
dt[, firm_total := .N, by = "FirmID"][
    dt, .(FirmID, firm_total), on = .(FirmID, ID < ID, start <= end, end >= start), nomatch = NULL, mult = "all"][
        , .(n = .N, pct = .N / (firm_total[1L] * (firm_total[1L] - 1L) / 2L)), by = "FirmID"]

We avoid redundant rows in the join with ID < ID
(and note that here that should be interpreted as left-hand side column's ID < right-hand side column's ID).
Redundancy can happen because,
if ID x overlaps with ID y,
y overlaps with x.

If you think of all pairs of IDs and put them in a matrix,
the maximum number of overlaps would be the number of elements in the lower triangular,
which can be calculated with n * (n - 1) / 2,
that's why we initially add firm_total.

I didn't do extensive testing but this version may be better for this scenario.
The documentation of foverlaps states that it's mainly targeted at joins where one table is much smaller than the other one,
and explains why it may be an expensive operation.
You're doing a self-join,
so both tables are the same size.

And a table.express version of the solution because why not:

library(table.express)

dt %>%
    group_by(FirmID) %>%
    mutate(firm_total = .N) %>%
    inner_join(dt, FirmID, ID < ID, start <= end, end >= start, .expr = TRUE) %>%
    select(FirmID, firm_total) %>%
    group_by(FirmID) %>%
    summarize(n = .N, pct = .N / (firm_total[1L] * (firm_total[1L] - 1L) / 2))

EDIT: and if you want your overlap column you could compute that with size of lower triangular - n.

回复收藏 0 原文

~没有更多了~