基于边缘平均值找到行的最佳子集和列的最佳子集

发布于 2025-02-07 06:58:14 字数 3877 浏览 2 评论 0原文

对于纵向数据集，我有一个二进制矩阵（df），该矩阵指示是否可以跨时间点（列）（列）使用数据（行）。我想找到一个最佳子集，每行至少2/3 == 1。一旦我删除一行少于2/3的数据可用的行，反之亦然。

# data structure example:
set.seed(42)
df <- as.data.table(matrix(rbinom(10*5,1,.66), ncol=10, nrow=5))

df
   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1:  0  1  1  0  0  1  0  0  1   0
2:  0  0  0  0  1  1  0  1  1   0
3:  1  1  0  1  0  0  1  1  1   1
4:  0  1  1  1  0  1  0  0  0   0
5:  1  0  1  1  1  0  1  1  1   1

从类似问题的标签中，这似乎是整数编程问题或多目标优化问题。不幸的是，我对这两种方法都不熟悉。我想同时使用A＆gt; = .66约束最大化Rowmeans colmeans - 但我不确定这是否是这里最有生产力的方法。

到目前为止，我试图调整类似问题的方法。使用基于此威胁我能够找到最佳的行和列子集：

best.list.row.df = list()
for (i in 1:nrow(df)) {
  # get best subset for rows based on how many columns have more than 66% data
  rowlist = combn(nrow(df), i)
  numobs = apply(rowlist, 2, function(x) sum(colMeans(df[x,])*100 >= 66))
  cat("For subsets of", i, "rows, the highest number of observations is", max(numobs), "out of the", ncol(df), "maximum. Product =", i*max(numobs),"\n")
  best = which(numobs == max(numobs))[1]
  best.list.row.df = c(best.list.row.df, list(rowlist[, best]))
}
> For subsets of 1 rows, the highest number of observations is 8 out of the 10 maximum. Product = 8 
> For subsets of 2 rows, the highest number of observations is 6 out of the 10 maximum. Product = 12 
> For subsets of 3 rows, the highest number of observations is 8 out of the 10 maximum. Product = 24 
> For subsets of 4 rows, the highest number of observations is 4 out of the 10 maximum. Product = 16 
> For subsets of 5 rows, the highest number of observations is 1 out of the 10 maximum. Product = 5 

best.list.col.df = list()
for (i in 1:ncol(df)) {
  # get best subset for columns based on how many rows have more than 66% data
  collist = combn(ncol(df), i)
  numobs = apply(collist, 2, function(x) sum(rowMeans(df[, ..x])*100 >= 66))
  cat("For subsets of", i, "columns, the highest number of participants is", max(numobs), "out of the", nrow(df), "maximum. Product =", i*max(numobs),"\n")
  best = which(numobs == max(numobs))[1]
  best.list.col.df = c(best.list.col.df, list(collist[, best]))
}
> For subsets of 1 columns, the highest number of participants is 4 out of the 5 maximum. Product = 4 
> For subsets of 2 columns, the highest number of participants is 3 out of the 5 maximum. Product = 6 
> For subsets of 3 columns, the highest number of participants is 5 out of the 5 maximum. Product = 15 
> For subsets of 4 columns, the highest number of participants is 4 out of the 5 maximum. Product = 16 
> For subsets of 5 columns, the highest number of participants is 2 out of the 5 maximum. Product = 10 
> For subsets of 6 columns, the highest number of participants is 4 out of the 5 maximum. Product = 24 
> For subsets of 7 columns, the highest number of participants is 2 out of the 5 maximum. Product = 14 
> For subsets of 8 columns, the highest number of participants is 2 out of the 5 maximum. Product = 16 
> For subsets of 9 columns, the highest number of participants is 2 out of the 5 maximum. Product = 18 
> For subsets of 10 columns, the highest number of participants is 2 out of the 5 maximum. Product = 20

基于这些结果，我将选择三行和六列的提供的解决方案，因为这些解决方案会单独地给我最有效的数据。

这种方法的问题是：（1）combn（）函数完全散布在我的较大的数据帧（最高71 x 155）。（2）它仍然不能同时解决两个“优化”。

其他潜在相关的问题：
如何优化在一个中的行和列的相交矩阵？

我真的希望我能够在这里充分描述我的目标。任何建议或想法将不胜感激。预先感谢您:)

原文

For a longitudinal dataset, I have a binary matrix (df) indicating whether data is available for cases (rows) across time points (columns). I would like to find the optimal subset where at least 2/3 of each row and column == 1. The issue I run into is that they depend on one another (i.e., the columns for which at least 2/3 of the rows have data available change as soon as I remove a row that has less than 2/3 data available, and vice versa).

# data structure example:
set.seed(42)
df <- as.data.table(matrix(rbinom(10*5,1,.66), ncol=10, nrow=5))

df
   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1:  0  1  1  0  0  1  0  0  1   0
2:  0  0  0  0  1  1  0  1  1   0
3:  1  1  0  1  0  0  1  1  1   1
4:  0  1  1  1  0  1  0  0  0   0
5:  1  0  1  1  1  0  1  1  1   1

From the tags of similar questions, this seems like an integer programming problem or multi-objective optimization issue. Unfortunately, I am not super familiar with either of those approaches. Intuitively, I would like to simultaneously maximize rowMeans and colMeans with a >= .66 constraint — but I'm not sure whether that is really the most productive approach here.

So far, I tried to adapt the approaches of similar questions. With a brute force approach based on this threat, I was able to find optimal row and column subsets:

best.list.row.df = list()
for (i in 1:nrow(df)) {
  # get best subset for rows based on how many columns have more than 66% data
  rowlist = combn(nrow(df), i)
  numobs = apply(rowlist, 2, function(x) sum(colMeans(df[x,])*100 >= 66))
  cat("For subsets of", i, "rows, the highest number of observations is", max(numobs), "out of the", ncol(df), "maximum. Product =", i*max(numobs),"\n")
  best = which(numobs == max(numobs))[1]
  best.list.row.df = c(best.list.row.df, list(rowlist[, best]))
}
> For subsets of 1 rows, the highest number of observations is 8 out of the 10 maximum. Product = 8 
> For subsets of 2 rows, the highest number of observations is 6 out of the 10 maximum. Product = 12 
> For subsets of 3 rows, the highest number of observations is 8 out of the 10 maximum. Product = 24 
> For subsets of 4 rows, the highest number of observations is 4 out of the 10 maximum. Product = 16 
> For subsets of 5 rows, the highest number of observations is 1 out of the 10 maximum. Product = 5 

best.list.col.df = list()
for (i in 1:ncol(df)) {
  # get best subset for columns based on how many rows have more than 66% data
  collist = combn(ncol(df), i)
  numobs = apply(collist, 2, function(x) sum(rowMeans(df[, ..x])*100 >= 66))
  cat("For subsets of", i, "columns, the highest number of participants is", max(numobs), "out of the", nrow(df), "maximum. Product =", i*max(numobs),"\n")
  best = which(numobs == max(numobs))[1]
  best.list.col.df = c(best.list.col.df, list(collist[, best]))
}
> For subsets of 1 columns, the highest number of participants is 4 out of the 5 maximum. Product = 4 
> For subsets of 2 columns, the highest number of participants is 3 out of the 5 maximum. Product = 6 
> For subsets of 3 columns, the highest number of participants is 5 out of the 5 maximum. Product = 15 
> For subsets of 4 columns, the highest number of participants is 4 out of the 5 maximum. Product = 16 
> For subsets of 5 columns, the highest number of participants is 2 out of the 5 maximum. Product = 10 
> For subsets of 6 columns, the highest number of participants is 4 out of the 5 maximum. Product = 24 
> For subsets of 7 columns, the highest number of participants is 2 out of the 5 maximum. Product = 14 
> For subsets of 8 columns, the highest number of participants is 2 out of the 5 maximum. Product = 16 
> For subsets of 9 columns, the highest number of participants is 2 out of the 5 maximum. Product = 18 
> For subsets of 10 columns, the highest number of participants is 2 out of the 5 maximum. Product = 20

Based on these results I would select the provided solution of three rows and six columns, as those would individually give me the most valid data.

The problem with this approach is: (1) the combn() function completely falls apart for my larger data frames (up to 71 X 155). (2) It still does not address the two "optimizations" at the same time.

Other potentially related question:
How to optimize intersect of rows and columns in a matrix?

I really hope I was able to adequately describe my aims here. Any suggestions or thoughts would be greatly appreciated. Thank you in advance already :)

分享到QQ

分享到微博