生成后验矩阵的组之间重合组合的数量
我有一个像 df: 这样的数据框
id <- c("A" , "A" , "A" , "A", "B", "B", "B", "C", "C", "C")
type <- c(1, 4, 3, 6, 1, 4, 5, 2, 3, 6)
df <- data_frame(id, type)
,我想计算每个 (id) 中发生的组合。
然后,我想使用数据生成对称矩阵(A):
A = matrix(
# Taking sequence of elements
c(NA, 0, 1, 2, 1, 1, 0, NA, 1, 0,0,1, 1, 1, NA, 1,0,2, 2, 0, 1, NA, 1, 1, 1,0,0,1, NA, 0, 1,1,2,1,0, NA),
# No of rows
nrow = 6,
# No of columns
ncol = 6,
# By default matrices are in column-wise order
# So this parameter decides how to arrange the matrix
byrow = TRUE
)
# Naming rows
rownames(A) = c("Type 1", "Type 2", "Type 3", "Type 4", "Type 5", "Type 6")
# Naming columns
colnames(A) = c("Type 1", "Type 2", "Type 3", "Type 4", "Type 5", "Type 6")
cat("Number of coincidences between Type by id")
print(A)
我的试验是通过以下方式进行的......
intermediate_step <- expand.grid(Variety1=unique(df$Type), # reshape with a symmetric output
Variety2=unique(df$Type), stringsAsFactors = F) %>%
mutate(counts = map2_dbl(Variety1, Variety2, ~length(intersect(df$id[df$Type ==.x],
df$id[df$Type ==.y])))) %>%
filter(Variety1 != Variety2)
library(tidyr)
AA <- spread(intermediate_step, Variety2, counts)
但是,出现了两个大问题:
- intermediate_step 不是正确计算计数。
- 这种方法的计算成本非常高。对于这个玩具示例来说,它是有效的。对于我的真实数据(93k 条目),RStudio 中止会话。
... 第二个问题的可能解决方案 ...
- 创建一个按以下方式工作的“循环”。它采用变量Type 的第一个元素(例如1)并消除所有不具有(1) 的id,例如(C)。然后,您使用来自其他两个 ID(例如 1 和 2)的信息创建一个更小的矩阵。然后,算法应该对到目前为止尚未选择的所有“Type”元素重复此步骤。
关于如何以更计算和更有效的方式执行分析或如何应用我提出的解决方案有任何线索吗?
谢谢 :)
I have a dataframe like df:
id <- c("A" , "A" , "A" , "A", "B", "B", "B", "C", "C", "C")
type <- c(1, 4, 3, 6, 1, 4, 5, 2, 3, 6)
df <- data_frame(id, type)
and I want to count the combinations happening in each (id).
Afterwards, I want to use the data to generate a symmetric matrix (A):
A = matrix(
# Taking sequence of elements
c(NA, 0, 1, 2, 1, 1, 0, NA, 1, 0,0,1, 1, 1, NA, 1,0,2, 2, 0, 1, NA, 1, 1, 1,0,0,1, NA, 0, 1,1,2,1,0, NA),
# No of rows
nrow = 6,
# No of columns
ncol = 6,
# By default matrices are in column-wise order
# So this parameter decides how to arrange the matrix
byrow = TRUE
)
# Naming rows
rownames(A) = c("Type 1", "Type 2", "Type 3", "Type 4", "Type 5", "Type 6")
# Naming columns
colnames(A) = c("Type 1", "Type 2", "Type 3", "Type 4", "Type 5", "Type 6")
cat("Number of coincidences between Type by id")
print(A)
My trial comes in the following way...
intermediate_step <- expand.grid(Variety1=unique(df$Type), # reshape with a symmetric output
Variety2=unique(df$Type), stringsAsFactors = F) %>%
mutate(counts = map2_dbl(Variety1, Variety2, ~length(intersect(df$id[df$Type ==.x],
df$id[df$Type ==.y])))) %>%
filter(Variety1 != Variety2)
library(tidyr)
AA <- spread(intermediate_step, Variety2, counts)
...However, two BIG PROBLEMS arise:
- intermediate_step is not computing the count correctly.
- this method is computationally super expensive. For this toy example, it works. For my real data (93k entries), RStudio aborts session.
... POSSIBLE SOLUTION TO THE SECOND PROBLEM ...
- To create a `loop' that works in the following way. It takes the first element of the variable Type (e.g. 1) and eliminates all id not having (1), e.g. (C). Then, you create a more small matrix with the info coming from the other two id's (e.g. 1, and 2). Then, the algorithm should repeat this step for all the "Type" elements that have not been selected so far.
Any clue on how to perform the analysis in a more computationally and efficient way or on how to apply my proposed solution?
Thank you :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
似乎您的数据不正确:
使用正确的数据 - 即第 2 行第 2 列应该是 4 而不是 2
df[2,2 <- 4
,您可以这样做:Seems your data is incorrect:
With the correct data - ie row 2 column 2 should be 4 and not 2
df[2,2 <- 4
, you could do: