在R中重新编码复杂的综合评分

发布于 2025-01-25 14:19:49 字数 3866 浏览 4 评论 0原文

假设我的研究涉及一项观察性纵向队列研究。

γ_comp成为感兴趣的综合结果, γ1 ....γ4 at time t1 < /strong>和 t2 表示γ_comp的组件。此外,数据集还有其他三个变量(χ1χ2χ3),这些变量将在以后的分析中使用,但并不是必需的代码γ_comp。 的摘录,

df <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), 
                    Y1_t1 = c(5, 6, 10, 7, 5, 7, 5, 4, 7, 4), 
                    Y2_t1 = c(6, 4, 8, 8, 7, 10, 7, 6, 5, 7), 
                    Y3_t1 = c(5, 6, 10, 4, 8, 5, 10, 5, 4, 6), 
                    Y4_t1 = c(4.5, 8.5, 9.5, 4.5, 5, 8, 4.5, 8.5, 4, 6), 
                    Y1_t2 = c(6, 4, 5, 5, 3, 4, 8, 4, 3, 2), 
                    Y2_t2 = c(5, 4, 3, 6, 5, 5, 5, 2, 2, 8), 
                    Y3_t2 = c(2, 2, 4, 5, 4, 9, 5, 3, 2, 4), 
                    Y4_t2 = c(3.5, 6, 5, 5, 4.5, 4, 2.5, 7, 4.5, 4), 
                    X1 = c(40, 45, 52, 44, 42, 65, 55, 61, 52, 49), 
                    X2 = c("NL", "UK", "NL", "US", "UK", "US", "NL", "NL", "UK", "UK"), 
                    X3 = c(2000, 2005, 2003, 2000, 2001, 2002, 2003, 2004, 2001, 2000)), 
                    class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, -10L))

这是data.frame 结构

spec_tbl_df [10 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ID   : num [1:10] 1 2 3 4 5 6 7 8 9 10
 $ Y1_t1: num [1:10] 5 6 10 7 5 7 5 4 7 4
 $ Y2_t1: num [1:10] 6 4 8 8 7 10 7 6 5 7
 $ Y3_t1: num [1:10] 5 6 10 4 8 5 10 5 4 6
 $ Y4_t1: num [1:10] 4.5 8.5 9.5 4.5 5 8 4.5 8.5 4 6
 $ Y1_t2: num [1:10] 6 4 5 5 3 4 8 4 3 2
 $ Y2_t2: num [1:10] 5 4 3 6 5 5 5 2 2 8
 $ Y3_t2: num [1:10] 2 2 4 5 4 9 5 3 2 4
 $ Y4_t2: num [1:10] 3.5 6 5 5 4.5 4 2.5 7 4.5 4
 $ X1   : num [1:10] 40 45 52 44 42 65 55 61 52 49
 $ X2   : chr [1:10] "NL" "UK" "NL" "US" ...
 $ X3   : num [1:10] 2000 2005 2003 2000 2001 ...

如前所述,我有兴趣计算γ_comp。记录的规则如下:

  • 4个组件中有3个(即,γ1....γ4必须在数字刻度上具有超过 20%的改进(IE降低) (0-10)[与 t2 相比,在 t1 时,在 t2 上更高]。
  • t2 相比

我相信必须采取以下步骤来实现这一目标。首先,必须为每个组件计算y1_diff = y1_t2/y1_t1。这是两个时间点之间的比例,应为&lt; 0.80。接下来,必须应用if_else条件,如果满足规则,并且0(如果不是(即,IE,)对治疗是否有回应)。

例如,这可能是所需的输出

      ID Ycomp Y1_t1 Y2_t1 Y3_t1 Y4_t1 Y1_t2 Y2_t2 Y3_t2 Y4_t2 Y1_diff Y2_diff Y3_diff Y4_diff    X1 X2       X3
 1     1     0     5     6     5   4.5     6     5     2   3.5    1.2     0.83    0.4     0.78    40 NL     2000
 2     2     1     6     4     6   8.5     4     4     2   6      0.67    1       0.33    0.71    45 UK     2005
 3     3     1    10     8    10   9.5     5     3     4   5      0.5     0.38    0.4     0.53    52 NL     2003
 4     4     0     7     8     4   4.5     5     6     5   5      0.71    0.75    1.25    1.11    44 US     2000
 5     5     1     5     7     8   5       3     5     4   4.5    0.6     0.71    0.5     0.9     42 UK     2001
 6     6     0     7    10     5   8       4     5     9   4      0.57    0.5     1.8     0.5     65 US     2002
 7     7     0     5     7    10   4.5     8     5     5   2.5    1.6     0.71    0.5     0.56    55 NL     2003
 8     8     0     4     6     5   8.5     4     2     3   7      1       0.33    0.6     0.82    61 NL     2004
 9     9     1     7     5     4   4       3     2     2   4.5    0.43    0.4     0.5     1.13    52 UK     2001
10    10     1     4     7     6   6       2     8     4   4      0.5     1.14    0.67    0.67    49 UK     2000

我感谢您对复合分数γ_comp的任何建议。也欢迎替代方法。这个想法是在将来的分析中使用γ_comp

Assume my research concerns an observational longitudinal cohort study.

Let γ_comp be the composite outcome of interest and γ1....γ4 at time t1 and t2 denote components of γ_comp. In addition, the dataset has three other variables (χ1, χ2, and χ3) which will be used in future analysis but are not necessary to code γ_comp. Here is an excerpt of the data.frame

df <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), 
                    Y1_t1 = c(5, 6, 10, 7, 5, 7, 5, 4, 7, 4), 
                    Y2_t1 = c(6, 4, 8, 8, 7, 10, 7, 6, 5, 7), 
                    Y3_t1 = c(5, 6, 10, 4, 8, 5, 10, 5, 4, 6), 
                    Y4_t1 = c(4.5, 8.5, 9.5, 4.5, 5, 8, 4.5, 8.5, 4, 6), 
                    Y1_t2 = c(6, 4, 5, 5, 3, 4, 8, 4, 3, 2), 
                    Y2_t2 = c(5, 4, 3, 6, 5, 5, 5, 2, 2, 8), 
                    Y3_t2 = c(2, 2, 4, 5, 4, 9, 5, 3, 2, 4), 
                    Y4_t2 = c(3.5, 6, 5, 5, 4.5, 4, 2.5, 7, 4.5, 4), 
                    X1 = c(40, 45, 52, 44, 42, 65, 55, 61, 52, 49), 
                    X2 = c("NL", "UK", "NL", "US", "UK", "US", "NL", "NL", "UK", "UK"), 
                    X3 = c(2000, 2005, 2003, 2000, 2001, 2002, 2003, 2004, 2001, 2000)), 
                    class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, -10L))

Structure

spec_tbl_df [10 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ID   : num [1:10] 1 2 3 4 5 6 7 8 9 10
 $ Y1_t1: num [1:10] 5 6 10 7 5 7 5 4 7 4
 $ Y2_t1: num [1:10] 6 4 8 8 7 10 7 6 5 7
 $ Y3_t1: num [1:10] 5 6 10 4 8 5 10 5 4 6
 $ Y4_t1: num [1:10] 4.5 8.5 9.5 4.5 5 8 4.5 8.5 4 6
 $ Y1_t2: num [1:10] 6 4 5 5 3 4 8 4 3 2
 $ Y2_t2: num [1:10] 5 4 3 6 5 5 5 2 2 8
 $ Y3_t2: num [1:10] 2 2 4 5 4 9 5 3 2 4
 $ Y4_t2: num [1:10] 3.5 6 5 5 4.5 4 2.5 7 4.5 4
 $ X1   : num [1:10] 40 45 52 44 42 65 55 61 52 49
 $ X2   : chr [1:10] "NL" "UK" "NL" "US" ...
 $ X3   : num [1:10] 2000 2005 2003 2000 2001 ...

As mentioned earlier, I am interested in calculating γ_comp. The rules for recording are as follows:

  • 3 out of 4 components (i.e., γ1....γ4 must have more than 20% improvement (i.e. decrease) on numeric scale (0 - 10) [higher is worse] at t2 compared to t1).
  • In the "remaining component," there should be no worsening of more than 20% at t2 compared to t1

I believe the following steps have to be taken to achieve this aim. First, Y1_diff = Y1_t2/Y1_t1 must be calculated for every component. This is the proportion between two time points and should be <0.80. Next, an if_else condition has to be applied, which reinforces these rules and returns 1 if the rules are met and 0 if not (i.e., "responded" to treatment or not).

For example, this could be a desired output:

      ID Ycomp Y1_t1 Y2_t1 Y3_t1 Y4_t1 Y1_t2 Y2_t2 Y3_t2 Y4_t2 Y1_diff Y2_diff Y3_diff Y4_diff    X1 X2       X3
 1     1     0     5     6     5   4.5     6     5     2   3.5    1.2     0.83    0.4     0.78    40 NL     2000
 2     2     1     6     4     6   8.5     4     4     2   6      0.67    1       0.33    0.71    45 UK     2005
 3     3     1    10     8    10   9.5     5     3     4   5      0.5     0.38    0.4     0.53    52 NL     2003
 4     4     0     7     8     4   4.5     5     6     5   5      0.71    0.75    1.25    1.11    44 US     2000
 5     5     1     5     7     8   5       3     5     4   4.5    0.6     0.71    0.5     0.9     42 UK     2001
 6     6     0     7    10     5   8       4     5     9   4      0.57    0.5     1.8     0.5     65 US     2002
 7     7     0     5     7    10   4.5     8     5     5   2.5    1.6     0.71    0.5     0.56    55 NL     2003
 8     8     0     4     6     5   8.5     4     2     3   7      1       0.33    0.6     0.82    61 NL     2004
 9     9     1     7     5     4   4       3     2     2   4.5    0.43    0.4     0.5     1.13    52 UK     2001
10    10     1     4     7     6   6       2     8     4   4      0.5     1.14    0.67    0.67    49 UK     2000

I would appreciate any advice on recoding the composite score γ_comp. Alternative methods are also welcome. The idea is to use γ_comp in logistic regression in future analysis.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

微凉徒眸意 2025-02-01 14:19:49

这应该为您做。

inner_join(
  df, 
  df %>%
    select(ID,starts_with("Y")) %>% 
    pivot_longer(!ID,names_to = c("Y","t"), names_sep="_") %>% 
    pivot_wider(id_cols = ID:Y, names_from=t, values_from = value) %>% 
    mutate(change=1-t2/t1) %>% 
    group_by(ID) %>% 
    mutate(impct = sum(change>0.2)) %>% 
    summarize(Y_comp=1*all(impct==4 | (impct==3 & min(change)>=-0.2))) 
) %>% relocate(Y_comp,.after = ID)

假设我的理解是正确的:输出:

      ID Y_comp Y1_t1 Y2_t1 Y3_t1 Y4_t1 Y1_t2 Y2_t2 Y3_t2 Y4_t2    X1 X2       X3
   <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
 1     1      0     5     6     5   4.5     6     5     2   3.5    40 NL     2000
 2     2      1     6     4     6   8.5     4     4     2   6      45 UK     2005
 3     3      1    10     8    10   9.5     5     3     4   5      52 NL     2003
 4     4      0     7     8     4   4.5     5     6     5   5      44 US     2000
 5     5      1     5     7     8   5       3     5     4   4.5    42 UK     2001
 6     6      0     7    10     5   8       4     5     9   4      65 US     2002
 7     7      0     5     7    10   4.5     8     5     5   2.5    55 NL     2003
 8     8      0     4     6     5   8.5     4     2     3   7      61 NL     2004
 9     9      1     7     5     4   4       3     2     2   4.5    52 UK     2001
10    10      1     4     7     6   6       2     8     4   4      49 UK     2000

说明:

这是df 和一个包含两个列ID> ID和<的新数据框架之间的内在加入, 代码> y_comp 。第二帧是如何创建的?

  1. 我选择列ID和以“ Y” i i键长的那些列
  2. ,然后透视枢轴以将数据输入具有四个列的格式(ID,y,y,t1和t2)。
  3. 在每一行,我估计更改为1-T2/T1。
  4. 对于每个ID(group_by(id)),我将生成一个列impt,因为次数更改超过0.2。 对于每个ID的id,这是常数
  5. ,如果所有行都有impct == 4(即所有都是改进,我将y_comp true定义为true )或,如果三个是改进,并且集合中的最小值不小于负0.2)。
  6. 我在同一行中乘以1,将y_comp转换为数字1/0,而不是
  7. 连接完成后的t/f,我将y_comp在ID之后移动y_comp,使用rostocate()

更新 有错误,可能是由命名空间碰撞引起的;一种解决方案是针对所使用的软件包具体说明:

library(magrittr)
dplyr::inner_join(
  df, 
  df %>%
    dplyr::select(ID,starts_with("Y")) %>% 
    tidyr::pivot_longer(!ID,names_to = c("Y","t"), names_sep="_") %>% 
    tidyr::pivot_wider(id_cols = ID:Y, names_from=t, values_from = value) %>% 
    dplyr::mutate(change=1-t2/t1) %>% 
    dplyr::group_by(ID) %>% 
    dplyr::mutate(impct = sum(change>0.2)) %>% 
    dplyr::summarize(Y_comp=1*all(impct==4 | (impct==3 & min(change)>=-0.2))) 
) %>% dplyr::relocate(Y_comp,.after = ID)

This should do it for you, assuming my understanding is correct:

inner_join(
  df, 
  df %>%
    select(ID,starts_with("Y")) %>% 
    pivot_longer(!ID,names_to = c("Y","t"), names_sep="_") %>% 
    pivot_wider(id_cols = ID:Y, names_from=t, values_from = value) %>% 
    mutate(change=1-t2/t1) %>% 
    group_by(ID) %>% 
    mutate(impct = sum(change>0.2)) %>% 
    summarize(Y_comp=1*all(impct==4 | (impct==3 & min(change)>=-0.2))) 
) %>% relocate(Y_comp,.after = ID)

Output:

      ID Y_comp Y1_t1 Y2_t1 Y3_t1 Y4_t1 Y1_t2 Y2_t2 Y3_t2 Y4_t2    X1 X2       X3
   <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
 1     1      0     5     6     5   4.5     6     5     2   3.5    40 NL     2000
 2     2      1     6     4     6   8.5     4     4     2   6      45 UK     2005
 3     3      1    10     8    10   9.5     5     3     4   5      52 NL     2003
 4     4      0     7     8     4   4.5     5     6     5   5      44 US     2000
 5     5      1     5     7     8   5       3     5     4   4.5    42 UK     2001
 6     6      0     7    10     5   8       4     5     9   4      65 US     2002
 7     7      0     5     7    10   4.5     8     5     5   2.5    55 NL     2003
 8     8      0     4     6     5   8.5     4     2     3   7      61 NL     2004
 9     9      1     7     5     4   4       3     2     2   4.5    52 UK     2001
10    10      1     4     7     6   6       2     8     4   4      49 UK     2000

Explanation:

This is an inner join between df, and a new dataframe that contains two columns ID and Y_comp. How is this second frame created?

  1. I select the columns ID and those starting with "Y"
  2. I pivot long, and the pivot wide to get the data into a format with four columns (ID, Y, t1, and t2).
  3. On each row, I estimate the change as 1-t2/t1.
  4. For each ID (group_by(ID)), I generate a column impt as the number of times change exceeds 0.2. This is constant over ID
  5. For each ID, I define Y_comp as TRUE if all of the rows have impct==4 (i.e. all are improvements) OR, if three are improvements and the minimum in the set is not less than negative 0.2).
  6. I multiply by 1 in that same line, to convert Y_comp to numeric 1/0, rather than T/F
  7. After the join is completed, I move Y_comp after ID, using relocate()

Update:

The OP is having an error, likely caused by namespace collision; one solution is to be specific about the packages being used:

library(magrittr)
dplyr::inner_join(
  df, 
  df %>%
    dplyr::select(ID,starts_with("Y")) %>% 
    tidyr::pivot_longer(!ID,names_to = c("Y","t"), names_sep="_") %>% 
    tidyr::pivot_wider(id_cols = ID:Y, names_from=t, values_from = value) %>% 
    dplyr::mutate(change=1-t2/t1) %>% 
    dplyr::group_by(ID) %>% 
    dplyr::mutate(impct = sum(change>0.2)) %>% 
    dplyr::summarize(Y_comp=1*all(impct==4 | (impct==3 & min(change)>=-0.2))) 
) %>% dplyr::relocate(Y_comp,.after = ID)
一场信仰旅途 2025-02-01 14:19:49

通过Langtang方法渗透,我发现了一个可能的解决方案

df <- df %>% mutate(Y1_diff = 
                case_when( Y1_t2/ Y1_t1 < 0.8 ~ 1,
                           Y1_t2 == 0 ~ 0,
                           Y1_t2/ Y1_t1 >= 0.8 & Y1_t2/ Y1_t1 <=1.2 ~ 0, 
                           TRUE ~ -1)) %>%
  mutate(Y2_diff = 
           case_when( Y2_t2/ Y2_t1 < 0.8 ~ 1,
                      Y2_t2 == 0 ~ 0,
                      Y2_t2/ Y2_t1 >= 0.8 & Y2_t2/ Y2_t1 <=1.2 ~ 0, 
                      TRUE ~ -1)) %>%
  mutate(Y3_diff = 
           case_when( Y3_t2/ Y3_t1 < 0.8 ~ 1,
                      Y3_t2 == 0 ~ 0,
                      Y3_t2/ Y3_t1 >= 0.8 & Y3_t2/ Y3_t1 <=1.2 ~ 0, 
                      TRUE ~ -1)) %>%
  mutate(Y4_diff = 
           case_when( Y4_t2/ Y4_t1 < 0.8 ~ 1,
                      Y4_t2 == 0 ~ 0,
                      Y4_t2/ Y4_t1 >= 0.8 & Y4_t2/ Y4_t1 <=1.2 ~ 0, 
                      TRUE ~ -1)) %>%
  mutate(Ycomp = 
           case_when(Y1_diff+Y2_diff+Y3_diff+Y4_diff >=3 ~ 1,
                     TRUE ~ 0))

解释

我首先创建四个变量,该变量评估相对差异是否小于0.8(IE ,提高了20%),在0.8-1.2之间或恶化,超过1.2。在改进的情况下,这些变量之间的这些(YN_DIFF)之间的编码为+1,+0如果在之间,则为-1。我还查看了是否在时间 t2 时,变量输出为零,并给出了0的得分,因为在我的真实数据集中,有一些方案 t1 和<强> t2 是0,这给出了 naan 误差。最后,我添加了所有变量,该变量在变量YCOMP中给出了正确的输出。

输出

      ID Ycomp Y1_t1 Y1_t2 Y2_t1 Y2_t2 Y3_t1 Y3_t2 Y4_t1 Y4_t2
 1     1     0     5     6     6     5     5     2   4.5   3.5
 2     2     1     6     4     4     4     6     2   8.5   6  
 3     3     1    10     5     8     3    10     4   9.5   5  
 4     4     0     7     5     8     6     4     5   4.5   5  
 5     5     1     5     3     7     5     8     4   5     4.5
 6     6     0     7     4    10     5     5     9   8     4  
 7     7     0     5     8     7     5    10     5   4.5   2.5
 8     8     0     4     4     6     2     5     3   8.5   7  
 9     9     1     7     3     5     2     4     2   4     4.5
10    10     1     4     2     7     8     6     4   6     4 

Insipired by the method of langtang, I found one possible solution to the problem:

df <- df %>% mutate(Y1_diff = 
                case_when( Y1_t2/ Y1_t1 < 0.8 ~ 1,
                           Y1_t2 == 0 ~ 0,
                           Y1_t2/ Y1_t1 >= 0.8 & Y1_t2/ Y1_t1 <=1.2 ~ 0, 
                           TRUE ~ -1)) %>%
  mutate(Y2_diff = 
           case_when( Y2_t2/ Y2_t1 < 0.8 ~ 1,
                      Y2_t2 == 0 ~ 0,
                      Y2_t2/ Y2_t1 >= 0.8 & Y2_t2/ Y2_t1 <=1.2 ~ 0, 
                      TRUE ~ -1)) %>%
  mutate(Y3_diff = 
           case_when( Y3_t2/ Y3_t1 < 0.8 ~ 1,
                      Y3_t2 == 0 ~ 0,
                      Y3_t2/ Y3_t1 >= 0.8 & Y3_t2/ Y3_t1 <=1.2 ~ 0, 
                      TRUE ~ -1)) %>%
  mutate(Y4_diff = 
           case_when( Y4_t2/ Y4_t1 < 0.8 ~ 1,
                      Y4_t2 == 0 ~ 0,
                      Y4_t2/ Y4_t1 >= 0.8 & Y4_t2/ Y4_t1 <=1.2 ~ 0, 
                      TRUE ~ -1)) %>%
  mutate(Ycomp = 
           case_when(Y1_diff+Y2_diff+Y3_diff+Y4_diff >=3 ~ 1,
                     TRUE ~ 0))

Explanation

I am creating four variables first, which assess whether the relative difference was less than 0.8 (i.e., 20% improved), between 0.8-1.2, or worsened and was more than >1.2. In the case of improvement, these between variables (Yn_diff) were coded +1, +0 if in between, and -1 if worsened. I also looked if, at time t2, the variable output was zero and gave it a score of 0 because, in my real dataset, there were scenario's where both t1 and t2 were 0, which gives NaaN error. Finally, I added up all the variables, which gives the correct output in the variable Ycomp.

Output

      ID Ycomp Y1_t1 Y1_t2 Y2_t1 Y2_t2 Y3_t1 Y3_t2 Y4_t1 Y4_t2
 1     1     0     5     6     6     5     5     2   4.5   3.5
 2     2     1     6     4     4     4     6     2   8.5   6  
 3     3     1    10     5     8     3    10     4   9.5   5  
 4     4     0     7     5     8     6     4     5   4.5   5  
 5     5     1     5     3     7     5     8     4   5     4.5
 6     6     0     7     4    10     5     5     9   8     4  
 7     7     0     5     8     7     5    10     5   4.5   2.5
 8     8     0     4     4     6     2     5     3   8.5   7  
 9     9     1     7     3     5     2     4     2   4     4.5
10    10     1     4     2     7     8     6     4   6     4 
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文