1：1与治疗组和对照组之间的多次匹配

发布于 2025-01-23 04:47:17 字数 2638 浏览 1 评论 0原文

嗨，我目前正在使用大型观察数据集来估计治疗的平均效果。为了平衡治疗组和对照组，我使用full_join命令根据一系列变量匹配了个人。

matched_sample <- full_join(case, control, by = matched_varaibles)

匹配的样本最终出现了许多行，因为有些人不止一次匹配。我记录了每个人找到的比赛数量。在这里，我提出了一个更简单的版本：

case_id <- c("A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "C", "C", "C", "C", "C", "D", "D", "E", "F", "F")
num_controls_matched <- c(7, 7, 7, 7, 7, 7, 7, 3, 3, 3, 5, 5, 5, 5, 5, 2, 2, 1, 2, 2)
control_id <- c("a" , "b", "c", "d", "e", "f", "g", "a", "b", "e", "a", "b", "e", "f", "h", "a", "e", "a", "b", "e")
num_cases_matched <- c(5, 4, 1, 1, 5, 2, 1, 5, 4, 5, 5, 4, 5, 2, 1, 5, 5, 5, 4, 5) 

   case_id num_controls_matched control_id num_cases_matched
1        A                    7          a                 5
2        A                    7          b                 4
3        A                    7          c                 1
4        A                    7          d                 1
5        A                    7          e                 5
6        A                    7          f                 2
7        A                    7          g                 1
8        B                    3          a                 5
9        B                    3          b                 4
10       B                    3          e                 5
11       C                    5          a                 5
12       C                    5          b                 4
13       C                    5          e                 5
14       C                    5          f                 2
15       C                    5          h                 1
16       D                    2          a                 5
17       D                    2          e                 5
18       E                    1          a                 5
19       F                    2          b                 4
20       F                    2          e                 5

case_id和Control_id是处理组和对照组的ID，NUM_CONTROLS_MATCHED是针对已治疗的个体找到的匹配数，而NUM_CASES_MATCHED是对照组中的个体找到的匹配项。

我想在样本中保留尽可能多的经过治疗的人。我还想优先考虑“不太受欢迎”的人的比赛。例如，处理过的个体E仅与1个对照匹配，因此应优先考虑匹配EA。然后，D和F都有2个匹配。由于B只有4个匹配项，而A和E都有5个匹配项，因此应优先考虑FB。因此，D只能与e匹配。下一个应该是B，因为它有3个匹配项。但是，由于A，B和E已经与D，E和F匹配，因此B没有匹配（na）。 C与H匹配，因为H只有1个匹配。 A可以与C，D或G匹配。

我想构建数据框架以指示最终的1：1匹配：

          case_id control_id
                A          g
                B         NA
                C          h
                D          e
                E          a
                F          b

原始数据集包含2,000多名个人，有些人有30多个匹配项。由于某些匹配变量的特征，倾向得分匹配并不是我想要的。我真的很感谢您的帮助。

原文

Hi I'm currently using a large observational dataset to estimate the average effect of a treatment. To balance the treatment and the control groups, I matched individuals based on a series of variables by using the full_join command.

matched_sample <- full_join(case, control, by = matched_varaibles)

The matched sample ended up with many rows because some individuals were matched more than once. I documented the number of matches found for each individual. Here I present a simpler version:

case_id <- c("A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "C", "C", "C", "C", "C", "D", "D", "E", "F", "F")
num_controls_matched <- c(7, 7, 7, 7, 7, 7, 7, 3, 3, 3, 5, 5, 5, 5, 5, 2, 2, 1, 2, 2)
control_id <- c("a" , "b", "c", "d", "e", "f", "g", "a", "b", "e", "a", "b", "e", "f", "h", "a", "e", "a", "b", "e")
num_cases_matched <- c(5, 4, 1, 1, 5, 2, 1, 5, 4, 5, 5, 4, 5, 2, 1, 5, 5, 5, 4, 5) 

   case_id num_controls_matched control_id num_cases_matched
1        A                    7          a                 5
2        A                    7          b                 4
3        A                    7          c                 1
4        A                    7          d                 1
5        A                    7          e                 5
6        A                    7          f                 2
7        A                    7          g                 1
8        B                    3          a                 5
9        B                    3          b                 4
10       B                    3          e                 5
11       C                    5          a                 5
12       C                    5          b                 4
13       C                    5          e                 5
14       C                    5          f                 2
15       C                    5          h                 1
16       D                    2          a                 5
17       D                    2          e                 5
18       E                    1          a                 5
19       F                    2          b                 4
20       F                    2          e                 5

where case_id and control_id are IDs of those from the treatment and the control groups, num_controls_matched is the number of matches found for the treated individuals, and num_cases_matched is the number of matches found for individuals in the control group.

I would like to keep as many treated individuals in the sample as possible. I would also like to prioritise the matches for the "less popular" individuals. For example, the treated individual E was only matched to 1 control, so the match E-a should be prioritised. Then, both D and F have 2 matches. Because b has only 4 matches whilst a and e both have 5 matches, F-b should be prioritised. Therefore, D can only be matched with e. The next one should be B because it has 3 matches. However, since a, b and e have already been matched with D, E and F, B has no match (NA). C is matched with h because h has only 1 match. A can be matched with c, d, or g.

I would like to construct data frame to indicate the final 1:1 matches:

          case_id control_id
                A          g
                B         NA
                C          h
                D          e
                E          a
                F          b

The original dataset include more than 2,000 individuals, and some individuals have more than 30 matches. Due to the characteristic of some matching variables, propensity score matching is not what I am looking for. I will be really grateful for your help on this.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

明媚如初 2025-01-30 04:47:17

fun <- function(df, i = 1){
  a <- df %>%
    filter(num_controls_matched == i | num_cases_matched == i)
  b <- df %>%
    filter(!(case_id %in% a$case_id | control_id %in% a$control_id))
  if (any(table(b$case_id) > 1)) fun(df, i + 1)
  else rbind(a, b)[c('case_id', 'control_id')]
}

fun(df)
  case_id control_id
1       A          a
2       B          b
3       C          c

fun <- function(df, i = 1){
  a <- df %>%
    filter(num_controls_matched == i | num_cases_matched == i)
  b <- df %>%
    filter(!(case_id %in% a$case_id | control_id %in% a$control_id))
  if (any(table(b$case_id) > 1)) fun(df, i + 1)
  else rbind(a, b)[c('case_id', 'control_id')]
}

fun(df)
  case_id control_id
1       A          a
2       B          b
3       C          c

回复收藏 0 原文

~没有更多了~