基于边缘平均值找到行的最佳子集和列的最佳子集
对于纵向数据集,我有一个二进制矩阵(df
),该矩阵指示是否可以跨时间点(列)(列)使用数据(行)。我想找到一个最佳子集,每行至少2/3 == 1。一旦我删除一行少于2/3的数据可用的行,反之亦然。
# data structure example:
set.seed(42)
df <- as.data.table(matrix(rbinom(10*5,1,.66), ncol=10, nrow=5))
df
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1: 0 1 1 0 0 1 0 0 1 0
2: 0 0 0 0 1 1 0 1 1 0
3: 1 1 0 1 0 0 1 1 1 1
4: 0 1 1 1 0 1 0 0 0 0
5: 1 0 1 1 1 0 1 1 1 1
从类似问题的标签中,这似乎是整数编程问题或多目标优化问题。不幸的是,我对这两种方法都不熟悉。我想同时使用A&gt; = .66约束最大化Rowmeans
colmeans - 但我不确定这是否是这里最有生产力的方法。
到目前为止,我试图调整类似问题的方法。使用基于此威胁我能够找到最佳的行和列子集:
best.list.row.df = list()
for (i in 1:nrow(df)) {
# get best subset for rows based on how many columns have more than 66% data
rowlist = combn(nrow(df), i)
numobs = apply(rowlist, 2, function(x) sum(colMeans(df[x,])*100 >= 66))
cat("For subsets of", i, "rows, the highest number of observations is", max(numobs), "out of the", ncol(df), "maximum. Product =", i*max(numobs),"\n")
best = which(numobs == max(numobs))[1]
best.list.row.df = c(best.list.row.df, list(rowlist[, best]))
}
> For subsets of 1 rows, the highest number of observations is 8 out of the 10 maximum. Product = 8
> For subsets of 2 rows, the highest number of observations is 6 out of the 10 maximum. Product = 12
> For subsets of 3 rows, the highest number of observations is 8 out of the 10 maximum. Product = 24
> For subsets of 4 rows, the highest number of observations is 4 out of the 10 maximum. Product = 16
> For subsets of 5 rows, the highest number of observations is 1 out of the 10 maximum. Product = 5
best.list.col.df = list()
for (i in 1:ncol(df)) {
# get best subset for columns based on how many rows have more than 66% data
collist = combn(ncol(df), i)
numobs = apply(collist, 2, function(x) sum(rowMeans(df[, ..x])*100 >= 66))
cat("For subsets of", i, "columns, the highest number of participants is", max(numobs), "out of the", nrow(df), "maximum. Product =", i*max(numobs),"\n")
best = which(numobs == max(numobs))[1]
best.list.col.df = c(best.list.col.df, list(collist[, best]))
}
> For subsets of 1 columns, the highest number of participants is 4 out of the 5 maximum. Product = 4
> For subsets of 2 columns, the highest number of participants is 3 out of the 5 maximum. Product = 6
> For subsets of 3 columns, the highest number of participants is 5 out of the 5 maximum. Product = 15
> For subsets of 4 columns, the highest number of participants is 4 out of the 5 maximum. Product = 16
> For subsets of 5 columns, the highest number of participants is 2 out of the 5 maximum. Product = 10
> For subsets of 6 columns, the highest number of participants is 4 out of the 5 maximum. Product = 24
> For subsets of 7 columns, the highest number of participants is 2 out of the 5 maximum. Product = 14
> For subsets of 8 columns, the highest number of participants is 2 out of the 5 maximum. Product = 16
> For subsets of 9 columns, the highest number of participants is 2 out of the 5 maximum. Product = 18
> For subsets of 10 columns, the highest number of participants is 2 out of the 5 maximum. Product = 20
基于这些结果,我将选择三行和六列的提供的解决方案,因为这些解决方案会单独地给我最有效的数据。
这种方法的问题是:(1)combn()
函数完全散布在我的较大的数据帧(最高71 x 155)。 (2)它仍然不能同时解决两个“优化”。
其他潜在相关的问题:
如何优化在一个中的行和列的相交矩阵?
我真的希望我能够在这里充分描述我的目标。任何建议或想法将不胜感激。预先感谢您:)
For a longitudinal dataset, I have a binary matrix (df
) indicating whether data is available for cases (rows) across time points (columns). I would like to find the optimal subset where at least 2/3 of each row and column == 1. The issue I run into is that they depend on one another (i.e., the columns for which at least 2/3 of the rows have data available change as soon as I remove a row that has less than 2/3 data available, and vice versa).
# data structure example:
set.seed(42)
df <- as.data.table(matrix(rbinom(10*5,1,.66), ncol=10, nrow=5))
df
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1: 0 1 1 0 0 1 0 0 1 0
2: 0 0 0 0 1 1 0 1 1 0
3: 1 1 0 1 0 0 1 1 1 1
4: 0 1 1 1 0 1 0 0 0 0
5: 1 0 1 1 1 0 1 1 1 1
From the tags of similar questions, this seems like an integer programming problem or multi-objective optimization issue. Unfortunately, I am not super familiar with either of those approaches. Intuitively, I would like to simultaneously maximize rowMeans
and colMeans
with a >= .66 constraint — but I'm not sure whether that is really the most productive approach here.
So far, I tried to adapt the approaches of similar questions. With a brute force approach based on this threat, I was able to find optimal row and column subsets:
best.list.row.df = list()
for (i in 1:nrow(df)) {
# get best subset for rows based on how many columns have more than 66% data
rowlist = combn(nrow(df), i)
numobs = apply(rowlist, 2, function(x) sum(colMeans(df[x,])*100 >= 66))
cat("For subsets of", i, "rows, the highest number of observations is", max(numobs), "out of the", ncol(df), "maximum. Product =", i*max(numobs),"\n")
best = which(numobs == max(numobs))[1]
best.list.row.df = c(best.list.row.df, list(rowlist[, best]))
}
> For subsets of 1 rows, the highest number of observations is 8 out of the 10 maximum. Product = 8
> For subsets of 2 rows, the highest number of observations is 6 out of the 10 maximum. Product = 12
> For subsets of 3 rows, the highest number of observations is 8 out of the 10 maximum. Product = 24
> For subsets of 4 rows, the highest number of observations is 4 out of the 10 maximum. Product = 16
> For subsets of 5 rows, the highest number of observations is 1 out of the 10 maximum. Product = 5
best.list.col.df = list()
for (i in 1:ncol(df)) {
# get best subset for columns based on how many rows have more than 66% data
collist = combn(ncol(df), i)
numobs = apply(collist, 2, function(x) sum(rowMeans(df[, ..x])*100 >= 66))
cat("For subsets of", i, "columns, the highest number of participants is", max(numobs), "out of the", nrow(df), "maximum. Product =", i*max(numobs),"\n")
best = which(numobs == max(numobs))[1]
best.list.col.df = c(best.list.col.df, list(collist[, best]))
}
> For subsets of 1 columns, the highest number of participants is 4 out of the 5 maximum. Product = 4
> For subsets of 2 columns, the highest number of participants is 3 out of the 5 maximum. Product = 6
> For subsets of 3 columns, the highest number of participants is 5 out of the 5 maximum. Product = 15
> For subsets of 4 columns, the highest number of participants is 4 out of the 5 maximum. Product = 16
> For subsets of 5 columns, the highest number of participants is 2 out of the 5 maximum. Product = 10
> For subsets of 6 columns, the highest number of participants is 4 out of the 5 maximum. Product = 24
> For subsets of 7 columns, the highest number of participants is 2 out of the 5 maximum. Product = 14
> For subsets of 8 columns, the highest number of participants is 2 out of the 5 maximum. Product = 16
> For subsets of 9 columns, the highest number of participants is 2 out of the 5 maximum. Product = 18
> For subsets of 10 columns, the highest number of participants is 2 out of the 5 maximum. Product = 20
Based on these results I would select the provided solution of three rows and six columns, as those would individually give me the most valid data.
The problem with this approach is: (1) the combn()
function completely falls apart for my larger data frames (up to 71 X 155). (2) It still does not address the two "optimizations" at the same time.
Other potentially related question:
How to optimize intersect of rows and columns in a matrix?
I really hope I was able to adequately describe my aims here. Any suggestions or thoughts would be greatly appreciated. Thank you in advance already :)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论