1:1与治疗组和对照组之间的多次匹配
嗨,我目前正在使用大型观察数据集来估计治疗的平均效果。为了平衡治疗组和对照组,我使用full_join
命令根据一系列变量匹配了个人。
matched_sample <- full_join(case, control, by = matched_varaibles)
匹配的样本最终出现了许多行,因为有些人不止一次匹配。我记录了每个人找到的比赛数量。在这里,我提出了一个更简单的版本:
case_id <- c("A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "C", "C", "C", "C", "C", "D", "D", "E", "F", "F")
num_controls_matched <- c(7, 7, 7, 7, 7, 7, 7, 3, 3, 3, 5, 5, 5, 5, 5, 2, 2, 1, 2, 2)
control_id <- c("a" , "b", "c", "d", "e", "f", "g", "a", "b", "e", "a", "b", "e", "f", "h", "a", "e", "a", "b", "e")
num_cases_matched <- c(5, 4, 1, 1, 5, 2, 1, 5, 4, 5, 5, 4, 5, 2, 1, 5, 5, 5, 4, 5)
case_id num_controls_matched control_id num_cases_matched
1 A 7 a 5
2 A 7 b 4
3 A 7 c 1
4 A 7 d 1
5 A 7 e 5
6 A 7 f 2
7 A 7 g 1
8 B 3 a 5
9 B 3 b 4
10 B 3 e 5
11 C 5 a 5
12 C 5 b 4
13 C 5 e 5
14 C 5 f 2
15 C 5 h 1
16 D 2 a 5
17 D 2 e 5
18 E 1 a 5
19 F 2 b 4
20 F 2 e 5
case_id和Control_id是处理组和对照组的ID,NUM_CONTROLS_MATCHED是针对已治疗的个体找到的匹配数,而NUM_CASES_MATCHED是对照组中的个体找到的匹配项。
我想在样本中保留尽可能多的经过治疗的人。我还想优先考虑“不太受欢迎”的人的比赛。例如,处理过的个体E仅与1个对照匹配,因此应优先考虑匹配EA。然后,D和F都有2个匹配。由于B只有4个匹配项,而A和E都有5个匹配项,因此应优先考虑FB。因此,D只能与e匹配。下一个应该是B,因为它有3个匹配项。但是,由于A,B和E已经与D,E和F匹配,因此B没有匹配(na
)。 C与H匹配,因为H只有1个匹配。 A可以与C,D或G匹配。
我想构建数据框架以指示最终的1:1匹配:
case_id control_id
A g
B NA
C h
D e
E a
F b
原始数据集包含2,000多名个人,有些人有30多个匹配项。由于某些匹配变量的特征,倾向得分匹配并不是我想要的。我真的很感谢您的帮助。
Hi I'm currently using a large observational dataset to estimate the average effect of a treatment. To balance the treatment and the control groups, I matched individuals based on a series of variables by using the full_join
command.
matched_sample <- full_join(case, control, by = matched_varaibles)
The matched sample ended up with many rows because some individuals were matched more than once. I documented the number of matches found for each individual. Here I present a simpler version:
case_id <- c("A", "A", "A", "A", "A", "A", "A", "B", "B", "B", "C", "C", "C", "C", "C", "D", "D", "E", "F", "F")
num_controls_matched <- c(7, 7, 7, 7, 7, 7, 7, 3, 3, 3, 5, 5, 5, 5, 5, 2, 2, 1, 2, 2)
control_id <- c("a" , "b", "c", "d", "e", "f", "g", "a", "b", "e", "a", "b", "e", "f", "h", "a", "e", "a", "b", "e")
num_cases_matched <- c(5, 4, 1, 1, 5, 2, 1, 5, 4, 5, 5, 4, 5, 2, 1, 5, 5, 5, 4, 5)
case_id num_controls_matched control_id num_cases_matched
1 A 7 a 5
2 A 7 b 4
3 A 7 c 1
4 A 7 d 1
5 A 7 e 5
6 A 7 f 2
7 A 7 g 1
8 B 3 a 5
9 B 3 b 4
10 B 3 e 5
11 C 5 a 5
12 C 5 b 4
13 C 5 e 5
14 C 5 f 2
15 C 5 h 1
16 D 2 a 5
17 D 2 e 5
18 E 1 a 5
19 F 2 b 4
20 F 2 e 5
where case_id and control_id are IDs of those from the treatment and the control groups, num_controls_matched is the number of matches found for the treated individuals, and num_cases_matched is the number of matches found for individuals in the control group.
I would like to keep as many treated individuals in the sample as possible. I would also like to prioritise the matches for the "less popular" individuals. For example, the treated individual E was only matched to 1 control, so the match E-a should be prioritised. Then, both D and F have 2 matches. Because b has only 4 matches whilst a and e both have 5 matches, F-b should be prioritised. Therefore, D can only be matched with e. The next one should be B because it has 3 matches. However, since a, b and e have already been matched with D, E and F, B has no match (NA
). C is matched with h because h has only 1 match. A can be matched with c, d, or g.
I would like to construct data frame to indicate the final 1:1 matches:
case_id control_id
A g
B NA
C h
D e
E a
F b
The original dataset include more than 2,000 individuals, and some individuals have more than 30 matches. Due to the characteristic of some matching variables, propensity score matching is not what I am looking for. I will be really grateful for your help on this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)