r dataframe multiple-columns data-manipulation

修剪基于r的行列相似性的数据框

发布于 2025-01-24 13:41:37 字数 430 浏览 0 评论 0原文

我有一个非常大的基因组基因座数据框架，其基因型得分为0、1或2。这是一个很小的样本，我认为这是一个很小的样本：

x1  x2  x3  x4
0   0   1   0
0   0   1   0
1   1   2   1
1   1   1   1
2   2   0   1
2   2   1   2

loci x1和x2相同，而x4高度相似。我希望实现的目标是创建一个函数，或使用已经存在的函数，为我的每个基因座分配相似性分数，然后根据我设置的阈值相似性来修剪数据集。

例如，如果我将阈值设置为1（100％），则它将仅修剪X1和X2，因为它们是重复的 - 我知道该怎么做。但是，如果我将阈值设置为0.8或80％的相似性，则除了X1和X2之外，它还还会修剪X4。

重要的是，该功能以行明相似性起作用，而不仅仅是比较列具有相似的0，1和2的分布。

原文

I have a very large dataframe of genomic loci with genotypes scored as 0, 1, or 2. Here is a very small sample that I think gets at the issue:

x1  x2  x3  x4
0   0   1   0
0   0   1   0
1   1   2   1
1   1   1   1
2   2   0   1
2   2   1   2

Loci x1 and x2 are identical while x4 is highly similar. What I am hoping to achieve is to create a function, or use one that already exists, to assign similarity scores, row-wise, for each of my loci and then prune the dataset based on a threshold similarity that I set.

For example, if I set the threshold at 1 (100%), it would prune only x1 and x2 as they are duplicates - which I know how to do. However, if I set the threshold at 0.8, or 80% similarity, it would also prune x4 in addition to x1 and x2.

It's important that the function acts on row-wise similarity and doesn't just compare that columns have similar distributions of 0's, 1's, and 2's.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

十雾 2025-01-31 13:41:37

这就是我要处理的方式。

首先，获取列名称的所有唯一配对的列表：

pairs <- expand.grid(names(df), names(df))
pairs <- pairs[lower.tri(replicate(length(df), names(df))),]

pairs
#>    Var1 Var2
#> 2    x2   x1
#> 3    x3   x1
#> 4    x4   x1
#> 7    x3   x2
#> 8    x4   x2
#> 12   x4   x3

现在通过此迭代，以比较原始数据集的每对唯一一对列中相同的行比例。这为您提供每个列对0到1之间的相似性分数：

pairs$similarity <- apply(pairs, 1, function(x) sum(df[x[1]] == df[x[2]])/nrow(df))

pairs
#>    Var1 Var2 similarity
#> 2    x2   x1  1.0000000
#> 3    x3   x1  0.1666667
#> 4    x4   x1  0.8333333
#> 7    x3   x2  0.1666667
#> 8    x4   x2  0.8333333
#> 12   x4   x3  0.1666667

现在删除该列表的所有行，这些行的相似性得分低于您选择的阈值（我们将在此处为0.8），

pairs <- pairs[which(pairs$similarity > 0.8),]

pairs
#>   Var1 Var2 similarity
#> 2   x2   x1  1.0000000
#> 4   x4   x1  0.8333333
#> 8   x4   x2  0.8333333

现在我们在中提取所有唯一列名称var1和var2，因为这些是与至少另一列相似的列：

keep_cols <- as.character(sort(unique(c(pairs$Var1, pairs$Var2))))
#> [1] "x1" "x2" "x4"

我们使用此列将原始数据框架子集来获得我们的期望结果：

df[match(keep_cols, names(df))]
#>   x1 x2 x4
#> 1  0  0  0
#> 2  0  0  0
#> 3  1  1  1
#> 4  1  1  1
#> 5  2  2  1
#> 6  2  2  2

当然，您可以将所有这些功能放在一个功能中，以使调整阈值并迭代应用更容易：

remove_dissimilar <- function(df, threshold = 0.8) {
  
  pairs <- expand.grid(names(df), names(df))
  pairs <- pairs[lower.tri(replicate(length(df), names(df))),]
  pairs$similarity <- apply(pairs, 1, function(x) {
    sum(df[x[1]] == df[x[2]])/nrow(df)})
  pairs <- pairs[which(pairs$similarity > threshold),]
  keep_cols <- as.character(sort(unique(c(pairs$Var1, pairs$Var2))))
  df[match(keep_cols, names(df))]
}

因此，现在您可以做：

remove_dissimilar(df, 0.8)
#>   x1 x2 x4
#> 1  0  0  0
#> 2  0  0  0
#> 3  1  1  1
#> 4  1  1  1
#> 5  2  2  1
#> 6  2  2  2

remove_dissimilar(df, 0.9)
#>   x1 x2
#> 1  0  0
#> 2  0  0
#> 3  1  1
#> 4  1  1
#> 5  2  2
#> 6  2  2

Here's how I would approach this.

First, get a listing of all the unique pairings of your column names:

pairs <- expand.grid(names(df), names(df))
pairs <- pairs[lower.tri(replicate(length(df), names(df))),]

pairs
#>    Var1 Var2
#> 2    x2   x1
#> 3    x3   x1
#> 4    x4   x1
#> 7    x3   x2
#> 8    x4   x2
#> 12   x4   x3

Now iterate through this to compare the proportion of rows that are identical in each unique pair of columns of your original data set. This gives you a similarity score between 0 to 1 for each column pair:

pairs$similarity <- apply(pairs, 1, function(x) sum(df[x[1]] == df[x[2]])/nrow(df))

pairs
#>    Var1 Var2 similarity
#> 2    x2   x1  1.0000000
#> 3    x3   x1  0.1666667
#> 4    x4   x1  0.8333333
#> 7    x3   x2  0.1666667
#> 8    x4   x2  0.8333333
#> 12   x4   x3  0.1666667

Now remove all the rows of this listing that have a similarity score below your chosen threshold (we'll make it 0.8 here)

pairs <- pairs[which(pairs$similarity > 0.8),]

pairs
#>   Var1 Var2 similarity
#> 2   x2   x1  1.0000000
#> 4   x4   x1  0.8333333
#> 8   x4   x2  0.8333333

Now we extract all the unique column names in Var1 and Var2, since these are the columns that are similar to at least one other column:

keep_cols <- as.character(sort(unique(c(pairs$Var1, pairs$Var2))))
#> [1] "x1" "x2" "x4"

And we subset our original data frame using this to get our desired result:

df[match(keep_cols, names(df))]
#>   x1 x2 x4
#> 1  0  0  0
#> 2  0  0  0
#> 3  1  1  1
#> 4  1  1  1
#> 5  2  2  1
#> 6  2  2  2

Of course, you could put all this in a function to make it easier to adjust your threshold and apply iteratively:

remove_dissimilar <- function(df, threshold = 0.8) {
  
  pairs <- expand.grid(names(df), names(df))
  pairs <- pairs[lower.tri(replicate(length(df), names(df))),]
  pairs$similarity <- apply(pairs, 1, function(x) {
    sum(df[x[1]] == df[x[2]])/nrow(df)})
  pairs <- pairs[which(pairs$similarity > threshold),]
  keep_cols <- as.character(sort(unique(c(pairs$Var1, pairs$Var2))))
  df[match(keep_cols, names(df))]
}

So now you could just do:

remove_dissimilar(df, 0.8)
#>   x1 x2 x4
#> 1  0  0  0
#> 2  0  0  0
#> 3  1  1  1
#> 4  1  1  1
#> 5  2  2  1
#> 6  2  2  2

remove_dissimilar(df, 0.9)
#>   x1 x2
#> 1  0  0
#> 2  0  0
#> 3  1  1
#> 4  1  1
#> 5  2  2
#> 6  2  2

回复收藏 0 原文

~没有更多了~