将行与列上差异最小的其他行进行匹配

发布于 2025-01-10 01:41:43 字数 965 浏览 5 评论 0原文

我想在数据框中的两个组之间执行匹配，其中属于一组（二进制）的所有行都与另一组的观察值（带替换）匹配，如果它们在另一列上的差异小于预设阈值。让我们使用下面的玩具数据集：

set.seed(123)

df <- data.frame(id = c(1:10),
                 group = rbinom(10,1, 0.3),
                 value = round(runif(10),2))

threshold <- round(sd(df$value),2)

看起来像这样

> df
   id group value
1   1     0  0.96
2   2     1  0.45
3   3     0  0.68
4   4     1  0.57
5   5     1  0.10
6   6     0  0.90
7   7     0  0.25
8   8     1  0.04
9   9     0  0.33
10 10     0  0.95

> threshold 
[1] 0.35

在这种情况下，我想将带有 group==1 的行与带有 group==2 的行进行匹配，其中差异值之间小于阈值(0.35)。这应该会导致一个看起来像这样的数据框（对潜在错误表示歉意，手动完成）。

   id matched_id
1   2          3
2   2          7
3   2          9
4   4          3
5   4          6
6   4          7
7   4          9
8   5          7
9   5          9
10  8          7
11  8          9

谢谢你！

原文

I want to perform matching between two groups in a data frame, where all rows belonging to one group (binary) are matched with observations from the other group (with replacement) if their difference on another column is smaller than a pre-set threshold. Let's use the toy-dataset below:

set.seed(123)

df <- data.frame(id = c(1:10),
                 group = rbinom(10,1, 0.3),
                 value = round(runif(10),2))

threshold <- round(sd(df$value),2)

Which looks like this

> df
   id group value
1   1     0  0.96
2   2     1  0.45
3   3     0  0.68
4   4     1  0.57
5   5     1  0.10
6   6     0  0.90
7   7     0  0.25
8   8     1  0.04
9   9     0  0.33
10 10     0  0.95

> threshold 
[1] 0.35

In this case, I want to match rows with group==1 with rows with group==2 where the difference between value is smaller than threshold(0.35). This should lead to a data frame looking like this (apologizes for potential error, did it manually).

   id matched_id
1   2          3
2   2          7
3   2          9
4   4          3
5   4          6
6   4          7
7   4          9
8   5          7
9   5          9
10  8          7
11  8          9

Thank you!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

短叹 2025-01-17 01:41:43

您可以使用 df |> left_join(df, by = character()) 这是执行笛卡尔积的 tidyverse 方式。然后根据阈值进行过滤。

library(dplyr)

df |>
  left_join(df, by = character()) |>
  filter(group.x != group.y,
         id.x < id.y,
         abs(value.x - value.y) < threshold)

#>+    id.x group.x value.x id.y group.y value.y
#>1     2       1    0.45    3       0    0.68
#>2     2       1    0.45    7       0    0.25
#>3     2       1    0.45    9       0    0.33
#>4     3       0    0.68    4       1    0.57
#>5     4       1    0.57    6       0    0.90
#>6     4       1    0.57    7       0    0.25
#>7     4       1    0.57    9       0    0.33
#>8     5       1    0.10    7       0    0.25
#>9     5       1    0.10    9       0    0.33
#>10    7       0    0.25    8       1    0.04
#>11    8       1    0.04    9       0    0.33

You can use df |> left_join(df, by = character()) which is the tidyverse way of performing a cartesian product. Then filter according to threshold.

library(dplyr)

df |>
  left_join(df, by = character()) |>
  filter(group.x != group.y,
         id.x < id.y,
         abs(value.x - value.y) < threshold)

#>+    id.x group.x value.x id.y group.y value.y
#>1     2       1    0.45    3       0    0.68
#>2     2       1    0.45    7       0    0.25
#>3     2       1    0.45    9       0    0.33
#>4     3       0    0.68    4       1    0.57
#>5     4       1    0.57    6       0    0.90
#>6     4       1    0.57    7       0    0.25
#>7     4       1    0.57    9       0    0.33
#>8     5       1    0.10    7       0    0.25
#>9     5       1    0.10    9       0    0.33
#>10    7       0    0.25    8       1    0.04
#>11    8       1    0.04    9       0    0.33

回复收藏 0 原文

时光瘦了 2025-01-17 01:41:43

更新的答案：在较大的数据集上运行缓慢，因此我尝试使代码更高效。

提出了一个似乎可以满足我要求的解决方案。不确定此代码在较大数据上的效率如何，但似乎有效。

library(tidyverse)
library(data.table)

# All values
dist_mat <- df$value

# Adding identifier
names(dist_mat) <- df$id

# Dropping combinations that are not of interest
dist_mat_col <-dist_mat[df$group == 0]
dist_mat_row <- dist_mat[df$group == 1]

# Difference between each value
dist_mat <- abs(outer(dist_mat_row, dist_mat_col, "-"))

# Identifying matches that fulfills the criteria
dist_mat <- dist_mat <= threshold 

# From matrix to a long dataframe
dist_mat <- melt(dist_mat)

# Tidying up the dataframe and dropping unneccecary columns and rows. 
dist_mat <- dist_mat %>%
  rename(id = Var1,
         matched_id = Var2,
         cond = value) %>%
  filter(cond == TRUE) %>%
  left_join(df, by = "id") %>%
  select(id, matched_id)

这导致以下数据框：

> arrange(dist_mat, id)
   id matched_id
1   2          3
2   2          7
3   2          9
4   4          3
5   4          6
6   4          7
7   4          9
8   5          7
9   5          9
10  8          7
11  8          9

UPDATED ANSWER: Was going slow on a larger dataset, so I tried to make the code a bit more efficient.

Came up with a solution that seems to do what I want. Not sure how efficient this code is on larger data but seems to work.

library(tidyverse)
library(data.table)

# All values
dist_mat <- df$value

# Adding identifier
names(dist_mat) <- df$id

# Dropping combinations that are not of interest
dist_mat_col <-dist_mat[df$group == 0]
dist_mat_row <- dist_mat[df$group == 1]

# Difference between each value
dist_mat <- abs(outer(dist_mat_row, dist_mat_col, "-"))

# Identifying matches that fulfills the criteria
dist_mat <- dist_mat <= threshold 

# From matrix to a long dataframe
dist_mat <- melt(dist_mat)

# Tidying up the dataframe and dropping unneccecary columns and rows. 
dist_mat <- dist_mat %>%
  rename(id = Var1,
         matched_id = Var2,
         cond = value) %>%
  filter(cond == TRUE) %>%
  left_join(df, by = "id") %>%
  select(id, matched_id)

This leads to the following dataframe:

> arrange(dist_mat, id)
   id matched_id
1   2          3
2   2          7
3   2          9
4   4          3
5   4          6
6   4          7
7   4          9
8   5          7
9   5          9
10  8          7
11  8          9

回复收藏 0 原文

~没有更多了~