在R中重新编码复杂的综合评分

发布于 2025-01-25 14:19:49 字数 3866 浏览 4 评论 0原文

假设我的研究涉及一项观察性纵向队列研究。

令γ_comp成为感兴趣的综合结果， γ1 ....γ4 at time t1 < /strong>和 t2 表示γ_comp的组件。此外，数据集还有其他三个变量（χ1，χ2和χ3），这些变量将在以后的分析中使用，但并不是必需的代码γ_comp。的摘录，

df <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), 
                    Y1_t1 = c(5, 6, 10, 7, 5, 7, 5, 4, 7, 4), 
                    Y2_t1 = c(6, 4, 8, 8, 7, 10, 7, 6, 5, 7), 
                    Y3_t1 = c(5, 6, 10, 4, 8, 5, 10, 5, 4, 6), 
                    Y4_t1 = c(4.5, 8.5, 9.5, 4.5, 5, 8, 4.5, 8.5, 4, 6), 
                    Y1_t2 = c(6, 4, 5, 5, 3, 4, 8, 4, 3, 2), 
                    Y2_t2 = c(5, 4, 3, 6, 5, 5, 5, 2, 2, 8), 
                    Y3_t2 = c(2, 2, 4, 5, 4, 9, 5, 3, 2, 4), 
                    Y4_t2 = c(3.5, 6, 5, 5, 4.5, 4, 2.5, 7, 4.5, 4), 
                    X1 = c(40, 45, 52, 44, 42, 65, 55, 61, 52, 49), 
                    X2 = c("NL", "UK", "NL", "US", "UK", "US", "NL", "NL", "UK", "UK"), 
                    X3 = c(2000, 2005, 2003, 2000, 2001, 2002, 2003, 2004, 2001, 2000)), 
                    class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, -10L))

这是data.frame 结构

spec_tbl_df [10 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ID   : num [1:10] 1 2 3 4 5 6 7 8 9 10
 $ Y1_t1: num [1:10] 5 6 10 7 5 7 5 4 7 4
 $ Y2_t1: num [1:10] 6 4 8 8 7 10 7 6 5 7
 $ Y3_t1: num [1:10] 5 6 10 4 8 5 10 5 4 6
 $ Y4_t1: num [1:10] 4.5 8.5 9.5 4.5 5 8 4.5 8.5 4 6
 $ Y1_t2: num [1:10] 6 4 5 5 3 4 8 4 3 2
 $ Y2_t2: num [1:10] 5 4 3 6 5 5 5 2 2 8
 $ Y3_t2: num [1:10] 2 2 4 5 4 9 5 3 2 4
 $ Y4_t2: num [1:10] 3.5 6 5 5 4.5 4 2.5 7 4.5 4
 $ X1   : num [1:10] 40 45 52 44 42 65 55 61 52 49
 $ X2   : chr [1:10] "NL" "UK" "NL" "US" ...
 $ X3   : num [1:10] 2000 2005 2003 2000 2001 ...

如前所述，我有兴趣计算γ_comp。记录的规则如下：

4个组件中有3个（即，γ1....γ4必须在数字刻度上具有超过 20％的改进（IE降低）（0-10）[与 t2 相比，在 t1 时，在 t2 上更高]。
与 t2 相比

我相信必须采取以下步骤来实现这一目标。首先，必须为每个组件计算y1_diff = y1_t2/y1_t1。这是两个时间点之间的比例，应为＆lt; 0.80。接下来，必须应用if_else条件，如果满足规则，并且0（如果不是（即，IE，）对治疗是否有回应）。

例如，这可能是所需的输出：

      ID Ycomp Y1_t1 Y2_t1 Y3_t1 Y4_t1 Y1_t2 Y2_t2 Y3_t2 Y4_t2 Y1_diff Y2_diff Y3_diff Y4_diff    X1 X2       X3
 1     1     0     5     6     5   4.5     6     5     2   3.5    1.2     0.83    0.4     0.78    40 NL     2000
 2     2     1     6     4     6   8.5     4     4     2   6      0.67    1       0.33    0.71    45 UK     2005
 3     3     1    10     8    10   9.5     5     3     4   5      0.5     0.38    0.4     0.53    52 NL     2003
 4     4     0     7     8     4   4.5     5     6     5   5      0.71    0.75    1.25    1.11    44 US     2000
 5     5     1     5     7     8   5       3     5     4   4.5    0.6     0.71    0.5     0.9     42 UK     2001
 6     6     0     7    10     5   8       4     5     9   4      0.57    0.5     1.8     0.5     65 US     2002
 7     7     0     5     7    10   4.5     8     5     5   2.5    1.6     0.71    0.5     0.56    55 NL     2003
 8     8     0     4     6     5   8.5     4     2     3   7      1       0.33    0.6     0.82    61 NL     2004
 9     9     1     7     5     4   4       3     2     2   4.5    0.43    0.4     0.5     1.13    52 UK     2001
10    10     1     4     7     6   6       2     8     4   4      0.5     1.14    0.67    0.67    49 UK     2000

我感谢您对复合分数γ_comp的任何建议。也欢迎替代方法。这个想法是在将来的分析中使用γ_comp。

原文

Assume my research concerns an observational longitudinal cohort study.

Let γ_comp be the composite outcome of interest and γ1....γ4 at time t1 and t2 denote components of γ_comp. In addition, the dataset has three other variables (χ1, χ2, and χ3) which will be used in future analysis but are not necessary to code γ_comp. Here is an excerpt of the data.frame

df <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), 
                    Y1_t1 = c(5, 6, 10, 7, 5, 7, 5, 4, 7, 4), 
                    Y2_t1 = c(6, 4, 8, 8, 7, 10, 7, 6, 5, 7), 
                    Y3_t1 = c(5, 6, 10, 4, 8, 5, 10, 5, 4, 6), 
                    Y4_t1 = c(4.5, 8.5, 9.5, 4.5, 5, 8, 4.5, 8.5, 4, 6), 
                    Y1_t2 = c(6, 4, 5, 5, 3, 4, 8, 4, 3, 2), 
                    Y2_t2 = c(5, 4, 3, 6, 5, 5, 5, 2, 2, 8), 
                    Y3_t2 = c(2, 2, 4, 5, 4, 9, 5, 3, 2, 4), 
                    Y4_t2 = c(3.5, 6, 5, 5, 4.5, 4, 2.5, 7, 4.5, 4), 
                    X1 = c(40, 45, 52, 44, 42, 65, 55, 61, 52, 49), 
                    X2 = c("NL", "UK", "NL", "US", "UK", "US", "NL", "NL", "UK", "UK"), 
                    X3 = c(2000, 2005, 2003, 2000, 2001, 2002, 2003, 2004, 2001, 2000)), 
                    class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA, -10L))

Structure

spec_tbl_df [10 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ID   : num [1:10] 1 2 3 4 5 6 7 8 9 10
 $ Y1_t1: num [1:10] 5 6 10 7 5 7 5 4 7 4
 $ Y2_t1: num [1:10] 6 4 8 8 7 10 7 6 5 7
 $ Y3_t1: num [1:10] 5 6 10 4 8 5 10 5 4 6
 $ Y4_t1: num [1:10] 4.5 8.5 9.5 4.5 5 8 4.5 8.5 4 6
 $ Y1_t2: num [1:10] 6 4 5 5 3 4 8 4 3 2
 $ Y2_t2: num [1:10] 5 4 3 6 5 5 5 2 2 8
 $ Y3_t2: num [1:10] 2 2 4 5 4 9 5 3 2 4
 $ Y4_t2: num [1:10] 3.5 6 5 5 4.5 4 2.5 7 4.5 4
 $ X1   : num [1:10] 40 45 52 44 42 65 55 61 52 49
 $ X2   : chr [1:10] "NL" "UK" "NL" "US" ...
 $ X3   : num [1:10] 2000 2005 2003 2000 2001 ...

As mentioned earlier, I am interested in calculating γ_comp. The rules for recording are as follows:

3 out of 4 components (i.e., γ1....γ4 must have more than 20% improvement (i.e. decrease) on numeric scale (0 - 10) [higher is worse] at t2 compared to t1).
In the "remaining component," there should be no worsening of more than 20% at t2 compared to t1

I believe the following steps have to be taken to achieve this aim. First, Y1_diff = Y1_t2/Y1_t1 must be calculated for every component. This is the proportion between two time points and should be <0.80. Next, an if_else condition has to be applied, which reinforces these rules and returns 1 if the rules are met and 0 if not (i.e., "responded" to treatment or not).

For example, this could be a desired output:

      ID Ycomp Y1_t1 Y2_t1 Y3_t1 Y4_t1 Y1_t2 Y2_t2 Y3_t2 Y4_t2 Y1_diff Y2_diff Y3_diff Y4_diff    X1 X2       X3
 1     1     0     5     6     5   4.5     6     5     2   3.5    1.2     0.83    0.4     0.78    40 NL     2000
 2     2     1     6     4     6   8.5     4     4     2   6      0.67    1       0.33    0.71    45 UK     2005
 3     3     1    10     8    10   9.5     5     3     4   5      0.5     0.38    0.4     0.53    52 NL     2003
 4     4     0     7     8     4   4.5     5     6     5   5      0.71    0.75    1.25    1.11    44 US     2000
 5     5     1     5     7     8   5       3     5     4   4.5    0.6     0.71    0.5     0.9     42 UK     2001
 6     6     0     7    10     5   8       4     5     9   4      0.57    0.5     1.8     0.5     65 US     2002
 7     7     0     5     7    10   4.5     8     5     5   2.5    1.6     0.71    0.5     0.56    55 NL     2003
 8     8     0     4     6     5   8.5     4     2     3   7      1       0.33    0.6     0.82    61 NL     2004
 9     9     1     7     5     4   4       3     2     2   4.5    0.43    0.4     0.5     1.13    52 UK     2001
10    10     1     4     7     6   6       2     8     4   4      0.5     1.14    0.67    0.67    49 UK     2000

I would appreciate any advice on recoding the composite score γ_comp. Alternative methods are also welcome. The idea is to use γ_comp in logistic regression in future analysis.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

微凉徒眸意 2025-02-01 14:19:49

这应该为您做。

inner_join(
  df, 
  df %>%
    select(ID,starts_with("Y")) %>% 
    pivot_longer(!ID,names_to = c("Y","t"), names_sep="_") %>% 
    pivot_wider(id_cols = ID:Y, names_from=t, values_from = value) %>% 
    mutate(change=1-t2/t1) %>% 
    group_by(ID) %>% 
    mutate(impct = sum(change>0.2)) %>% 
    summarize(Y_comp=1*all(impct==4 | (impct==3 & min(change)>=-0.2))) 
) %>% relocate(Y_comp,.after = ID)

假设我的理解是正确的：输出：

      ID Y_comp Y1_t1 Y2_t1 Y3_t1 Y4_t1 Y1_t2 Y2_t2 Y3_t2 Y4_t2    X1 X2       X3
   <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
 1     1      0     5     6     5   4.5     6     5     2   3.5    40 NL     2000
 2     2      1     6     4     6   8.5     4     4     2   6      45 UK     2005
 3     3      1    10     8    10   9.5     5     3     4   5      52 NL     2003
 4     4      0     7     8     4   4.5     5     6     5   5      44 US     2000
 5     5      1     5     7     8   5       3     5     4   4.5    42 UK     2001
 6     6      0     7    10     5   8       4     5     9   4      65 US     2002
 7     7      0     5     7    10   4.5     8     5     5   2.5    55 NL     2003
 8     8      0     4     6     5   8.5     4     2     3   7      61 NL     2004
 9     9      1     7     5     4   4       3     2     2   4.5    52 UK     2001
10    10      1     4     7     6   6       2     8     4   4      49 UK     2000

说明：

这是df 和一个包含两个列ID> ID和<的新数据框架之间的内在加入，代码> y_comp 。第二帧是如何创建的？

我选择列ID和以“ Y” i i键长的那些列
，然后透视枢轴以将数据输入具有四个列的格式（ID，y，y，t1和t2）。
在每一行，我估计更改为1-T2/T1。
对于每个ID（group_by（id）），我将生成一个列impt，因为次数更改超过0.2。对于每个ID的id，这是常数
，如果所有行都有impct == 4（即所有都是改进，我将y_comp true定义为true ）或，如果三个是改进，并且集合中的最小值不小于负0.2）。
我在同一行中乘以1，将y_comp转换为数字1/0，而不是
连接完成后的t/f，我将y_comp在ID之后移动y_comp，使用rostocate（）

：

更新有错误，可能是由命名空间碰撞引起的；一种解决方案是针对所使用的软件包具体说明：

library(magrittr)
dplyr::inner_join(
  df, 
  df %>%
    dplyr::select(ID,starts_with("Y")) %>% 
    tidyr::pivot_longer(!ID,names_to = c("Y","t"), names_sep="_") %>% 
    tidyr::pivot_wider(id_cols = ID:Y, names_from=t, values_from = value) %>% 
    dplyr::mutate(change=1-t2/t1) %>% 
    dplyr::group_by(ID) %>% 
    dplyr::mutate(impct = sum(change>0.2)) %>% 
    dplyr::summarize(Y_comp=1*all(impct==4 | (impct==3 & min(change)>=-0.2))) 
) %>% dplyr::relocate(Y_comp,.after = ID)

This should do it for you, assuming my understanding is correct:

inner_join(
  df, 
  df %>%
    select(ID,starts_with("Y")) %>% 
    pivot_longer(!ID,names_to = c("Y","t"), names_sep="_") %>% 
    pivot_wider(id_cols = ID:Y, names_from=t, values_from = value) %>% 
    mutate(change=1-t2/t1) %>% 
    group_by(ID) %>% 
    mutate(impct = sum(change>0.2)) %>% 
    summarize(Y_comp=1*all(impct==4 | (impct==3 & min(change)>=-0.2))) 
) %>% relocate(Y_comp,.after = ID)

Output:

      ID Y_comp Y1_t1 Y2_t1 Y3_t1 Y4_t1 Y1_t2 Y2_t2 Y3_t2 Y4_t2    X1 X2       X3
   <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
 1     1      0     5     6     5   4.5     6     5     2   3.5    40 NL     2000
 2     2      1     6     4     6   8.5     4     4     2   6      45 UK     2005
 3     3      1    10     8    10   9.5     5     3     4   5      52 NL     2003
 4     4      0     7     8     4   4.5     5     6     5   5      44 US     2000
 5     5      1     5     7     8   5       3     5     4   4.5    42 UK     2001
 6     6      0     7    10     5   8       4     5     9   4      65 US     2002
 7     7      0     5     7    10   4.5     8     5     5   2.5    55 NL     2003
 8     8      0     4     6     5   8.5     4     2     3   7      61 NL     2004
 9     9      1     7     5     4   4       3     2     2   4.5    52 UK     2001
10    10      1     4     7     6   6       2     8     4   4      49 UK     2000

Explanation:

This is an inner join between df, and a new dataframe that contains two columns ID and Y_comp. How is this second frame created?

I select the columns ID and those starting with "Y"
I pivot long, and the pivot wide to get the data into a format with four columns (ID, Y, t1, and t2).
On each row, I estimate the change as 1-t2/t1.
For each ID (group_by(ID)), I generate a column impt as the number of times change exceeds 0.2. This is constant over ID
For each ID, I define Y_comp as TRUE if all of the rows have impct==4 (i.e. all are improvements) OR, if three are improvements and the minimum in the set is not less than negative 0.2).
I multiply by 1 in that same line, to convert Y_comp to numeric 1/0, rather than T/F
After the join is completed, I move Y_comp after ID, using relocate()

Update:

The OP is having an error, likely caused by namespace collision; one solution is to be specific about the packages being used:

library(magrittr)
dplyr::inner_join(
  df, 
  df %>%
    dplyr::select(ID,starts_with("Y")) %>% 
    tidyr::pivot_longer(!ID,names_to = c("Y","t"), names_sep="_") %>% 
    tidyr::pivot_wider(id_cols = ID:Y, names_from=t, values_from = value) %>% 
    dplyr::mutate(change=1-t2/t1) %>% 
    dplyr::group_by(ID) %>% 
    dplyr::mutate(impct = sum(change>0.2)) %>% 
    dplyr::summarize(Y_comp=1*all(impct==4 | (impct==3 & min(change)>=-0.2))) 
) %>% dplyr::relocate(Y_comp,.after = ID)

回复收藏 0 原文

一场信仰旅途 2025-02-01 14:19:49

通过Langtang方法渗透，我发现了一个可能的解决方案：

df <- df %>% mutate(Y1_diff = 
                case_when( Y1_t2/ Y1_t1 < 0.8 ~ 1,
                           Y1_t2 == 0 ~ 0,
                           Y1_t2/ Y1_t1 >= 0.8 & Y1_t2/ Y1_t1 <=1.2 ~ 0, 
                           TRUE ~ -1)) %>%
  mutate(Y2_diff = 
           case_when( Y2_t2/ Y2_t1 < 0.8 ~ 1,
                      Y2_t2 == 0 ~ 0,
                      Y2_t2/ Y2_t1 >= 0.8 & Y2_t2/ Y2_t1 <=1.2 ~ 0, 
                      TRUE ~ -1)) %>%
  mutate(Y3_diff = 
           case_when( Y3_t2/ Y3_t1 < 0.8 ~ 1,
                      Y3_t2 == 0 ~ 0,
                      Y3_t2/ Y3_t1 >= 0.8 & Y3_t2/ Y3_t1 <=1.2 ~ 0, 
                      TRUE ~ -1)) %>%
  mutate(Y4_diff = 
           case_when( Y4_t2/ Y4_t1 < 0.8 ~ 1,
                      Y4_t2 == 0 ~ 0,
                      Y4_t2/ Y4_t1 >= 0.8 & Y4_t2/ Y4_t1 <=1.2 ~ 0, 
                      TRUE ~ -1)) %>%
  mutate(Ycomp = 
           case_when(Y1_diff+Y2_diff+Y3_diff+Y4_diff >=3 ~ 1,
                     TRUE ~ 0))

解释

我首先创建四个变量，该变量评估相对差异是否小于0.8（IE ，提高了20％），在0.8-1.2之间或恶化，超过1.2。在改进的情况下，这些变量之间的这些（YN_DIFF）之间的编码为+1，+0如果在之间，则为-1。我还查看了是否在时间 t2 时，变量输出为零，并给出了0的得分，因为在我的真实数据集中，有一些方案 t1 和<强> t2 是0，这给出了 naan 误差。最后，我添加了所有变量，该变量在变量YCOMP中给出了正确的输出。

输出

      ID Ycomp Y1_t1 Y1_t2 Y2_t1 Y2_t2 Y3_t1 Y3_t2 Y4_t1 Y4_t2
 1     1     0     5     6     6     5     5     2   4.5   3.5
 2     2     1     6     4     4     4     6     2   8.5   6  
 3     3     1    10     5     8     3    10     4   9.5   5  
 4     4     0     7     5     8     6     4     5   4.5   5  
 5     5     1     5     3     7     5     8     4   5     4.5
 6     6     0     7     4    10     5     5     9   8     4  
 7     7     0     5     8     7     5    10     5   4.5   2.5
 8     8     0     4     4     6     2     5     3   8.5   7  
 9     9     1     7     3     5     2     4     2   4     4.5
10    10     1     4     2     7     8     6     4   6     4

Insipired by the method of langtang, I found one possible solution to the problem:

df <- df %>% mutate(Y1_diff = 
                case_when( Y1_t2/ Y1_t1 < 0.8 ~ 1,
                           Y1_t2 == 0 ~ 0,
                           Y1_t2/ Y1_t1 >= 0.8 & Y1_t2/ Y1_t1 <=1.2 ~ 0, 
                           TRUE ~ -1)) %>%
  mutate(Y2_diff = 
           case_when( Y2_t2/ Y2_t1 < 0.8 ~ 1,
                      Y2_t2 == 0 ~ 0,
                      Y2_t2/ Y2_t1 >= 0.8 & Y2_t2/ Y2_t1 <=1.2 ~ 0, 
                      TRUE ~ -1)) %>%
  mutate(Y3_diff = 
           case_when( Y3_t2/ Y3_t1 < 0.8 ~ 1,
                      Y3_t2 == 0 ~ 0,
                      Y3_t2/ Y3_t1 >= 0.8 & Y3_t2/ Y3_t1 <=1.2 ~ 0, 
                      TRUE ~ -1)) %>%
  mutate(Y4_diff = 
           case_when( Y4_t2/ Y4_t1 < 0.8 ~ 1,
                      Y4_t2 == 0 ~ 0,
                      Y4_t2/ Y4_t1 >= 0.8 & Y4_t2/ Y4_t1 <=1.2 ~ 0, 
                      TRUE ~ -1)) %>%
  mutate(Ycomp = 
           case_when(Y1_diff+Y2_diff+Y3_diff+Y4_diff >=3 ~ 1,
                     TRUE ~ 0))

Explanation

I am creating four variables first, which assess whether the relative difference was less than 0.8 (i.e., 20% improved), between 0.8-1.2, or worsened and was more than >1.2. In the case of improvement, these between variables (Yn_diff) were coded +1, +0 if in between, and -1 if worsened. I also looked if, at time t2, the variable output was zero and gave it a score of 0 because, in my real dataset, there were scenario's where both t1 and t2 were 0, which gives NaaN error. Finally, I added up all the variables, which gives the correct output in the variable Ycomp.

Output

      ID Ycomp Y1_t1 Y1_t2 Y2_t1 Y2_t2 Y3_t1 Y3_t2 Y4_t1 Y4_t2
 1     1     0     5     6     6     5     5     2   4.5   3.5
 2     2     1     6     4     4     4     6     2   8.5   6  
 3     3     1    10     5     8     3    10     4   9.5   5  
 4     4     0     7     5     8     6     4     5   4.5   5  
 5     5     1     5     3     7     5     8     4   5     4.5
 6     6     0     7     4    10     5     5     9   8     4  
 7     7     0     5     8     7     5    10     5   4.5   2.5
 8     8     0     4     4     6     2     5     3   8.5   7  
 9     9     1     7     3     5     2     4     2   4     4.5
10    10     1     4     2     7     8     6     4   6     4

回复收藏 0 原文

~没有更多了~