与之相关

发布于 2025-01-25 02:35:29 字数 431 浏览 2 评论 0原文

DATA = data.frame("GROUP" = sort(rep(1:4, 200)),
                  "TYPE" = rep(1:2, 400),
                  "TIME" = rep(100:101, 400),
                  "SCORE" = sample(1:100,r=T,800))

欢呼,

我有“数据”,并希望以这种方式估算每个时间的分数和分数和类型组合的分数和类型组合的相关值:

DATA = data.frame("GROUP" = sort(rep(1:4, 200)),
                  "TYPE" = rep(1:2, 400),
                  "TIME" = rep(100:101, 400),
                  "SCORE" = sample(1:100,r=T,800))

Cheers all,

I have 'DATA' and wish to estimation the CORRELATION VALUES of SCORE at each TIME and SCORE and TYPE combination BETWEEN AND WITHIN GROUP in this way:
enter image description here

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

ゞ花落谁相伴 2025-02-01 02:35:30

我假设您想计算每种时间和类型组合的组1-2、1-3、1-4之类的相关性。这是一种方法:

# create the dataset
set.seed(123)
df <- data.frame("group" = sort(rep(1:4, 200)),
                 "type" = rep(1:2, 400),
                 "time" = rep(100:101, 400),
                 "score" = sample(1:100,r=T,800))
library(tidyr)
library(purrr)
library(data.table)

# another dataset to filter combinations 
# (G1G2 is same G2G1, so remove G2G1)
df2 <- combn(4, 2) %>% t %>% 
    as_tibble() %>% 
    rename(group1 = V1, group2 = V2) %>% 
    mutate(value = TRUE)

df %>% 
    
    # add identifiers per group
    group_by(time, type, group) %>% 
    mutate(id = row_number()) %>% 
    ungroup() %>% 
    
    # nest data to get separate tibble for each
    # combination of time and type
    nest(data = -c(time, type)) %>% 
    
    # convert each data.frame to data.table
    mutate(dt = map(data, function(dt){
        setDT(dt)
        setkey(dt, id)
        dt
    })) %>% 
    
    # correlation between groups in R
    # refer answer below for more details
    # https://stackoverflow.com/a/26357667/15221658
    
    # cartesian join of dts
    mutate(dtj = map(dt, ~.[., allow.cartesian = TRUE])) %>% 
    
    # compute between group correlation
    mutate(cors = map(dtj, ~.[, list(cors = cor(score, i.score)), by = list(group, i.group)])) %>% 
    
    # unnest correlation object
    unnest(cors) %>% 
    
    # formatting for display
    select(type, time, group1 = group, group2 = i.group, correlation = cors) %>% 
    filter(group1 != group2) %>%
    arrange(time, group1, group2) %>% 

    # now use df2 since currently we have G1G2, and G2G1
    # which are both equal so remove G2G1
    left_join(df2, by = c("group1", "group2")) %>% 
    filter(value) %>% 
    select(-value)

# A tibble: 12 x 5
    type  time group1 group2 correlation
   <int> <int>  <int>  <int>       <dbl>
 1     1   100      1      2     0.121  
 2     1   100      1      3     0.0543 
 3     1   100      1      4    -0.0694 
 4     1   100      2      3    -0.104  
 5     1   100      2      4    -0.0479 
 6     1   100      3      4    -0.0365 
 7     2   101      1      2    -0.181  
 8     2   101      1      3    -0.0673 
 9     2   101      1      4     0.00765
10     2   101      2      3     0.0904 
11     2   101      2      4    -0.0126 
12     2   101      3      4    -0.154 
    

I am assuming you want to compute the correlation between groups 1-2, 1-3, 1-4 and so on for each combination of TIME and TYPE. Here's an approach:

# create the dataset
set.seed(123)
df <- data.frame("group" = sort(rep(1:4, 200)),
                 "type" = rep(1:2, 400),
                 "time" = rep(100:101, 400),
                 "score" = sample(1:100,r=T,800))
library(tidyr)
library(purrr)
library(data.table)

# another dataset to filter combinations 
# (G1G2 is same G2G1, so remove G2G1)
df2 <- combn(4, 2) %>% t %>% 
    as_tibble() %>% 
    rename(group1 = V1, group2 = V2) %>% 
    mutate(value = TRUE)

df %>% 
    
    # add identifiers per group
    group_by(time, type, group) %>% 
    mutate(id = row_number()) %>% 
    ungroup() %>% 
    
    # nest data to get separate tibble for each
    # combination of time and type
    nest(data = -c(time, type)) %>% 
    
    # convert each data.frame to data.table
    mutate(dt = map(data, function(dt){
        setDT(dt)
        setkey(dt, id)
        dt
    })) %>% 
    
    # correlation between groups in R
    # refer answer below for more details
    # https://stackoverflow.com/a/26357667/15221658
    
    # cartesian join of dts
    mutate(dtj = map(dt, ~.[., allow.cartesian = TRUE])) %>% 
    
    # compute between group correlation
    mutate(cors = map(dtj, ~.[, list(cors = cor(score, i.score)), by = list(group, i.group)])) %>% 
    
    # unnest correlation object
    unnest(cors) %>% 
    
    # formatting for display
    select(type, time, group1 = group, group2 = i.group, correlation = cors) %>% 
    filter(group1 != group2) %>%
    arrange(time, group1, group2) %>% 

    # now use df2 since currently we have G1G2, and G2G1
    # which are both equal so remove G2G1
    left_join(df2, by = c("group1", "group2")) %>% 
    filter(value) %>% 
    select(-value)

# A tibble: 12 x 5
    type  time group1 group2 correlation
   <int> <int>  <int>  <int>       <dbl>
 1     1   100      1      2     0.121  
 2     1   100      1      3     0.0543 
 3     1   100      1      4    -0.0694 
 4     1   100      2      3    -0.104  
 5     1   100      2      4    -0.0479 
 6     1   100      3      4    -0.0365 
 7     2   101      1      2    -0.181  
 8     2   101      1      3    -0.0673 
 9     2   101      1      4     0.00765
10     2   101      2      3     0.0904 
11     2   101      2      4    -0.0126 
12     2   101      3      4    -0.154 
    
給妳壹絲溫柔 2025-02-01 02:35:30

这是一种替代方法,它可以创建timetype的所有唯一组合,并通过交叉加入 group group s group >然后计算得分的相关性 data的对应子集:

library(data.table) # development version 1.14.3 required
setDT(DATA, key = c("GROUP", "TYPE", "TIME"))[
  , CJ(time = TIME, type = TYPE, groupA = GROUP, groupB = GROUP, unique = TRUE)][
    groupA < groupB][
      , corType := paste0("G", groupA, "G", groupB)][][
        , corValue := cor(DATA[.(groupA, type, time), SCORE], 
                          DATA[.(groupB, type, time), SCORE]), 
        by = .I][]
 时间类型groupa groupb cortype corvalue
 1:100 1 1 2 G1G2 0.11523940
 2:100 1 1 3 G1G3 -0.05124326
 3:100 1 1 4 G1G4 -0.16943203
 4:100 1 2 3 G2G3 0.05475435
 5:100 1 2 4 G2G4 -0.10769738
 6:100 1 3 4 G3G4 0.01464146
 7:100 2 1 2 g1g2 na
 8:100 2 1 3 g1g3 na
 9:100 2 1 4 g1g4 na
10:100 2 2 3 G2G3 Na
11:100 2 2 4 G2G4 NA
12:100 2 3 4 G3G4 NA
13:101 1 1 2 g1g2 na
14:101 1 1 3 g1g3 na
15:101 1 1 4 g1g4 na
16:101 1 2 3 g2g3 na
17:101 1 2 4 G2G4 NA
18:101 1 3 4 G3G4 NA
19:101 2 1 2 G1G2 -0.04997479
20:101 2 1 3 G1G3 -0.02262932
21:101 2 1 4 G1G4 -0.00331578
22:101 2 2 3 G2G3 -0.01243952
23:101 2 2 4 G2G4 0.16683223
24:101 2 3 4 G3G4 -0.10556083
    时间类型group groupb cortype corvalue
 

说明

  1. <代码>数据被胁迫到类data.table在列上设置键grouptype和<代码>时间。 键盘以后快速子集是必需的。
  2. 交叉加入 cj()创建列的所有唯一组合timetype> typegroup 和(两次)。 交叉加入的列已重命名,以避免以后的名称冲突。
  3. [Groupa&lt; groupB]确保groupagroupb仅出现一次的等效组合,例如,g2g1被删除以支持G1G2。因此,这是一种数据。
  4. 新列cortype通过参考附加
  5. 最后,通过通过交叉加入表(使用by = .i)和子集data by 来计算组的相关性。 groupa,<代码>类型,timegroupbtypetime ,分别使用使用快速子集。请,请参阅vignette norofollow noreferrer> 有关更多详细信息。
    请注意,by = .idata.table开发版本1.14.3的新功能。

data中不存在的时间,类型和组的组合将出现在结果集中,但以na在列中标记为corvalue

数据

set.seed(42) # required for reproducible data
DATA = data.frame("GROUP" = sort(rep(1:4, 200)),
                  "TYPE" = rep(1:2, 400),
                  "TIME" = rep(100:101, 400),
                  "SCORE" = sample(1:100, r=T, 800))

Here is an alternative approach which creates all unique combinations of TIME, TYPE, and duplicated GROUPs through a cross join and then computes the correlation of SCORE for the correspondings subsets of DATA:

library(data.table) # development version 1.14.3 required
setDT(DATA, key = c("GROUP", "TYPE", "TIME"))[
  , CJ(time = TIME, type = TYPE, groupA = GROUP, groupB = GROUP, unique = TRUE)][
    groupA < groupB][
      , corType := paste0("G", groupA, "G", groupB)][][
        , corValue := cor(DATA[.(groupA, type, time), SCORE], 
                          DATA[.(groupB, type, time), SCORE]), 
        by = .I][]
    time type groupA groupB corType    corValue
 1:  100    1      1      2    G1G2  0.11523940
 2:  100    1      1      3    G1G3 -0.05124326
 3:  100    1      1      4    G1G4 -0.16943203
 4:  100    1      2      3    G2G3  0.05475435
 5:  100    1      2      4    G2G4 -0.10769738
 6:  100    1      3      4    G3G4  0.01464146
 7:  100    2      1      2    G1G2          NA
 8:  100    2      1      3    G1G3          NA
 9:  100    2      1      4    G1G4          NA
10:  100    2      2      3    G2G3          NA
11:  100    2      2      4    G2G4          NA
12:  100    2      3      4    G3G4          NA
13:  101    1      1      2    G1G2          NA
14:  101    1      1      3    G1G3          NA
15:  101    1      1      4    G1G4          NA
16:  101    1      2      3    G2G3          NA
17:  101    1      2      4    G2G4          NA
18:  101    1      3      4    G3G4          NA
19:  101    2      1      2    G1G2 -0.04997479
20:  101    2      1      3    G1G3 -0.02262932
21:  101    2      1      4    G1G4 -0.00331578
22:  101    2      2      3    G2G3 -0.01243952
23:  101    2      2      4    G2G4  0.16683223
24:  101    2      3      4    G3G4 -0.10556083
    time type groupA groupB corType    corValue

Explanation

  1. DATA is coerced to class data.table while setting a key on columns GROUP, TYPE, and TIME. Keying is required for fast subsetting later.
  2. The cross join CJ() creates all unique combinations of columns TIME, TYPE, GROUP, and GROUP (twice). The columns of the cross join have been renamed to avoid name clashes later on.
  3. [groupA < groupB] ensures that equivalent combinations of groupA and groupB only appear once, e.g., G2G1 is dropped in favour of G1G2. So, this is kind of data.table version of t(combn(unique(DATA$GROUP), 2)).
  4. A new column corType is append by reference.
  5. Finally, the groupwise correlations are computed by stepping rowwise through the cross join table (using by = .I) and subsetting DATA by groupA, type, time and groupB, type, time, resp., using fast subsetting through keys. Please, see the vignette Keys and fast binary search based subset for more details.
    Note that by = .I is a new feature of data.table development version 1.14.3.

Combinations of time, type, and group which do not exist in DATA will appear in the result set but are marked by NA in column corValue.

Data

set.seed(42) # required for reproducible data
DATA = data.frame("GROUP" = sort(rep(1:4, 200)),
                  "TYPE" = rep(1:2, 400),
                  "TIME" = rep(100:101, 400),
                  "SCORE" = sample(1:100, r=T, 800))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文