基于边缘平均值找到行的最佳子集和列的最佳子集

发布于 2025-02-07 06:58:14 字数 3877 浏览 2 评论 0原文

对于纵向数据集,我有一个二进制矩阵(df),该矩阵指示是否可以跨时间点(列)(列)使用数据(行)。我想找到一个最佳子集,每行至少2/3 == 1。一旦我删除一行少于2/3的数据可用的行,反之亦然。

# data structure example:
set.seed(42)
df <- as.data.table(matrix(rbinom(10*5,1,.66), ncol=10, nrow=5))

df
   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1:  0  1  1  0  0  1  0  0  1   0
2:  0  0  0  0  1  1  0  1  1   0
3:  1  1  0  1  0  0  1  1  1   1
4:  0  1  1  1  0  1  0  0  0   0
5:  1  0  1  1  1  0  1  1  1   1

从类似问题的标签中,这似乎是整数编程问题或多目标优化问题。不幸的是,我对这两种方法都不熟悉。我想同时使用A&gt; = .66约束最大化Rowmeans colmeans - 但我不确定这是否是这里最有生产力的方法。

到目前为止,我试图调整类似问题的方法。使用基于此威胁我能够找到最佳的行和列子集:

best.list.row.df = list()
for (i in 1:nrow(df)) {
  # get best subset for rows based on how many columns have more than 66% data
  rowlist = combn(nrow(df), i)
  numobs = apply(rowlist, 2, function(x) sum(colMeans(df[x,])*100 >= 66))
  cat("For subsets of", i, "rows, the highest number of observations is", max(numobs), "out of the", ncol(df), "maximum. Product =", i*max(numobs),"\n")
  best = which(numobs == max(numobs))[1]
  best.list.row.df = c(best.list.row.df, list(rowlist[, best]))
}
> For subsets of 1 rows, the highest number of observations is 8 out of the 10 maximum. Product = 8 
> For subsets of 2 rows, the highest number of observations is 6 out of the 10 maximum. Product = 12 
> For subsets of 3 rows, the highest number of observations is 8 out of the 10 maximum. Product = 24 
> For subsets of 4 rows, the highest number of observations is 4 out of the 10 maximum. Product = 16 
> For subsets of 5 rows, the highest number of observations is 1 out of the 10 maximum. Product = 5 

best.list.col.df = list()
for (i in 1:ncol(df)) {
  # get best subset for columns based on how many rows have more than 66% data
  collist = combn(ncol(df), i)
  numobs = apply(collist, 2, function(x) sum(rowMeans(df[, ..x])*100 >= 66))
  cat("For subsets of", i, "columns, the highest number of participants is", max(numobs), "out of the", nrow(df), "maximum. Product =", i*max(numobs),"\n")
  best = which(numobs == max(numobs))[1]
  best.list.col.df = c(best.list.col.df, list(collist[, best]))
}
> For subsets of 1 columns, the highest number of participants is 4 out of the 5 maximum. Product = 4 
> For subsets of 2 columns, the highest number of participants is 3 out of the 5 maximum. Product = 6 
> For subsets of 3 columns, the highest number of participants is 5 out of the 5 maximum. Product = 15 
> For subsets of 4 columns, the highest number of participants is 4 out of the 5 maximum. Product = 16 
> For subsets of 5 columns, the highest number of participants is 2 out of the 5 maximum. Product = 10 
> For subsets of 6 columns, the highest number of participants is 4 out of the 5 maximum. Product = 24 
> For subsets of 7 columns, the highest number of participants is 2 out of the 5 maximum. Product = 14 
> For subsets of 8 columns, the highest number of participants is 2 out of the 5 maximum. Product = 16 
> For subsets of 9 columns, the highest number of participants is 2 out of the 5 maximum. Product = 18 
> For subsets of 10 columns, the highest number of participants is 2 out of the 5 maximum. Product = 20 

基于这些结果,我将选择三行和六列的提供的解决方案,因为这些解决方案会单独地给我最有效的数据。

这种方法的问题是:(1)combn()函数完全散布在我的较大的数据帧(最高71 x 155)。 (2)它仍然不能同时解决两个“优化”。

其他潜在相关的问题:
如何优化在一个中的行和列的相交矩阵?

我真的希望我能够在这里充分描述我的目标。任何建议或想法将不胜感激。预先感谢您:)

For a longitudinal dataset, I have a binary matrix (df) indicating whether data is available for cases (rows) across time points (columns). I would like to find the optimal subset where at least 2/3 of each row and column == 1. The issue I run into is that they depend on one another (i.e., the columns for which at least 2/3 of the rows have data available change as soon as I remove a row that has less than 2/3 data available, and vice versa).

# data structure example:
set.seed(42)
df <- as.data.table(matrix(rbinom(10*5,1,.66), ncol=10, nrow=5))

df
   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1:  0  1  1  0  0  1  0  0  1   0
2:  0  0  0  0  1  1  0  1  1   0
3:  1  1  0  1  0  0  1  1  1   1
4:  0  1  1  1  0  1  0  0  0   0
5:  1  0  1  1  1  0  1  1  1   1

From the tags of similar questions, this seems like an integer programming problem or multi-objective optimization issue. Unfortunately, I am not super familiar with either of those approaches. Intuitively, I would like to simultaneously maximize rowMeans and colMeans with a >= .66 constraint — but I'm not sure whether that is really the most productive approach here.

So far, I tried to adapt the approaches of similar questions. With a brute force approach based on this threat, I was able to find optimal row and column subsets:

best.list.row.df = list()
for (i in 1:nrow(df)) {
  # get best subset for rows based on how many columns have more than 66% data
  rowlist = combn(nrow(df), i)
  numobs = apply(rowlist, 2, function(x) sum(colMeans(df[x,])*100 >= 66))
  cat("For subsets of", i, "rows, the highest number of observations is", max(numobs), "out of the", ncol(df), "maximum. Product =", i*max(numobs),"\n")
  best = which(numobs == max(numobs))[1]
  best.list.row.df = c(best.list.row.df, list(rowlist[, best]))
}
> For subsets of 1 rows, the highest number of observations is 8 out of the 10 maximum. Product = 8 
> For subsets of 2 rows, the highest number of observations is 6 out of the 10 maximum. Product = 12 
> For subsets of 3 rows, the highest number of observations is 8 out of the 10 maximum. Product = 24 
> For subsets of 4 rows, the highest number of observations is 4 out of the 10 maximum. Product = 16 
> For subsets of 5 rows, the highest number of observations is 1 out of the 10 maximum. Product = 5 

best.list.col.df = list()
for (i in 1:ncol(df)) {
  # get best subset for columns based on how many rows have more than 66% data
  collist = combn(ncol(df), i)
  numobs = apply(collist, 2, function(x) sum(rowMeans(df[, ..x])*100 >= 66))
  cat("For subsets of", i, "columns, the highest number of participants is", max(numobs), "out of the", nrow(df), "maximum. Product =", i*max(numobs),"\n")
  best = which(numobs == max(numobs))[1]
  best.list.col.df = c(best.list.col.df, list(collist[, best]))
}
> For subsets of 1 columns, the highest number of participants is 4 out of the 5 maximum. Product = 4 
> For subsets of 2 columns, the highest number of participants is 3 out of the 5 maximum. Product = 6 
> For subsets of 3 columns, the highest number of participants is 5 out of the 5 maximum. Product = 15 
> For subsets of 4 columns, the highest number of participants is 4 out of the 5 maximum. Product = 16 
> For subsets of 5 columns, the highest number of participants is 2 out of the 5 maximum. Product = 10 
> For subsets of 6 columns, the highest number of participants is 4 out of the 5 maximum. Product = 24 
> For subsets of 7 columns, the highest number of participants is 2 out of the 5 maximum. Product = 14 
> For subsets of 8 columns, the highest number of participants is 2 out of the 5 maximum. Product = 16 
> For subsets of 9 columns, the highest number of participants is 2 out of the 5 maximum. Product = 18 
> For subsets of 10 columns, the highest number of participants is 2 out of the 5 maximum. Product = 20 

Based on these results I would select the provided solution of three rows and six columns, as those would individually give me the most valid data.

The problem with this approach is: (1) the combn() function completely falls apart for my larger data frames (up to 71 X 155). (2) It still does not address the two "optimizations" at the same time.

Other potentially related question:
How to optimize intersect of rows and columns in a matrix?

I really hope I was able to adequately describe my aims here. Any suggestions or thoughts would be greatly appreciated. Thank you in advance already :)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文