修剪基于r的行列相似性的数据框

发布于 2025-01-24 13:41:37 字数 430 浏览 0 评论 0原文

我有一个非常大的基因组基因座数据框架,其基因型得分为0、1或2。这是一个很小的样本,我认为这是一个很小的样本:

x1  x2  x3  x4
0   0   1   0
0   0   1   0
1   1   2   1
1   1   1   1
2   2   0   1
2   2   1   2

loci x1和x2相同,而x4高度相似。我希望实现的目标是创建一个函数,或使用已经存在的函数,为我的每个基因座分配相似性分数,然后根据我设置的阈值相似性来修剪数据集。

例如,如果我将阈值设置为1(100%),则它将仅修剪X1和X2,因为它们是重复的 - 我知道该怎么做。但是,如果我将阈值设置为0.8或80%的相似性,则除了X1和X2之外,它还还会修剪X4。

重要的是,该功能以行明相似性起作用,而不仅仅是比较列具有相似的0,1和2的分布。

I have a very large dataframe of genomic loci with genotypes scored as 0, 1, or 2. Here is a very small sample that I think gets at the issue:

x1  x2  x3  x4
0   0   1   0
0   0   1   0
1   1   2   1
1   1   1   1
2   2   0   1
2   2   1   2

Loci x1 and x2 are identical while x4 is highly similar. What I am hoping to achieve is to create a function, or use one that already exists, to assign similarity scores, row-wise, for each of my loci and then prune the dataset based on a threshold similarity that I set.

For example, if I set the threshold at 1 (100%), it would prune only x1 and x2 as they are duplicates - which I know how to do. However, if I set the threshold at 0.8, or 80% similarity, it would also prune x4 in addition to x1 and x2.

It's important that the function acts on row-wise similarity and doesn't just compare that columns have similar distributions of 0's, 1's, and 2's.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

十雾 2025-01-31 13:41:37

这就是我要处理的方式。

首先,获取列名称的所有唯一配对的列表:

pairs <- expand.grid(names(df), names(df))
pairs <- pairs[lower.tri(replicate(length(df), names(df))),]

pairs
#>    Var1 Var2
#> 2    x2   x1
#> 3    x3   x1
#> 4    x4   x1
#> 7    x3   x2
#> 8    x4   x2
#> 12   x4   x3

现在通过此迭代,以比较原始数据集的每对唯一一对列中相同的行比例。这为您提供每个列对0到1之间的相似性分数:

pairs$similarity <- apply(pairs, 1, function(x) sum(df[x[1]] == df[x[2]])/nrow(df))

pairs
#>    Var1 Var2 similarity
#> 2    x2   x1  1.0000000
#> 3    x3   x1  0.1666667
#> 4    x4   x1  0.8333333
#> 7    x3   x2  0.1666667
#> 8    x4   x2  0.8333333
#> 12   x4   x3  0.1666667

现在删除该列表的所有行,这些行的相似性得分低于您选择的阈值(我们将在此处为0.8),

pairs <- pairs[which(pairs$similarity > 0.8),]

pairs
#>   Var1 Var2 similarity
#> 2   x2   x1  1.0000000
#> 4   x4   x1  0.8333333
#> 8   x4   x2  0.8333333

现在我们在中提取所有唯一列名称var1var2,因为这些是与至少另一列相似的列:

keep_cols <- as.character(sort(unique(c(pairs$Var1, pairs$Var2))))
#> [1] "x1" "x2" "x4"

我们使用此列将原始数据框架子集来获得我们的期望结果:

df[match(keep_cols, names(df))]
#>   x1 x2 x4
#> 1  0  0  0
#> 2  0  0  0
#> 3  1  1  1
#> 4  1  1  1
#> 5  2  2  1
#> 6  2  2  2

当然,您可以将所有这些功能放在一个功能中,以使调整阈值并迭代应用更容易:

remove_dissimilar <- function(df, threshold = 0.8) {
  
  pairs <- expand.grid(names(df), names(df))
  pairs <- pairs[lower.tri(replicate(length(df), names(df))),]
  pairs$similarity <- apply(pairs, 1, function(x) {
    sum(df[x[1]] == df[x[2]])/nrow(df)})
  pairs <- pairs[which(pairs$similarity > threshold),]
  keep_cols <- as.character(sort(unique(c(pairs$Var1, pairs$Var2))))
  df[match(keep_cols, names(df))]
}

因此,现在您可以做:

remove_dissimilar(df, 0.8)
#>   x1 x2 x4
#> 1  0  0  0
#> 2  0  0  0
#> 3  1  1  1
#> 4  1  1  1
#> 5  2  2  1
#> 6  2  2  2

remove_dissimilar(df, 0.9)
#>   x1 x2
#> 1  0  0
#> 2  0  0
#> 3  1  1
#> 4  1  1
#> 5  2  2
#> 6  2  2

Here's how I would approach this.

First, get a listing of all the unique pairings of your column names:

pairs <- expand.grid(names(df), names(df))
pairs <- pairs[lower.tri(replicate(length(df), names(df))),]

pairs
#>    Var1 Var2
#> 2    x2   x1
#> 3    x3   x1
#> 4    x4   x1
#> 7    x3   x2
#> 8    x4   x2
#> 12   x4   x3

Now iterate through this to compare the proportion of rows that are identical in each unique pair of columns of your original data set. This gives you a similarity score between 0 to 1 for each column pair:

pairs$similarity <- apply(pairs, 1, function(x) sum(df[x[1]] == df[x[2]])/nrow(df))

pairs
#>    Var1 Var2 similarity
#> 2    x2   x1  1.0000000
#> 3    x3   x1  0.1666667
#> 4    x4   x1  0.8333333
#> 7    x3   x2  0.1666667
#> 8    x4   x2  0.8333333
#> 12   x4   x3  0.1666667

Now remove all the rows of this listing that have a similarity score below your chosen threshold (we'll make it 0.8 here)

pairs <- pairs[which(pairs$similarity > 0.8),]

pairs
#>   Var1 Var2 similarity
#> 2   x2   x1  1.0000000
#> 4   x4   x1  0.8333333
#> 8   x4   x2  0.8333333

Now we extract all the unique column names in Var1 and Var2, since these are the columns that are similar to at least one other column:

keep_cols <- as.character(sort(unique(c(pairs$Var1, pairs$Var2))))
#> [1] "x1" "x2" "x4"

And we subset our original data frame using this to get our desired result:

df[match(keep_cols, names(df))]
#>   x1 x2 x4
#> 1  0  0  0
#> 2  0  0  0
#> 3  1  1  1
#> 4  1  1  1
#> 5  2  2  1
#> 6  2  2  2

Of course, you could put all this in a function to make it easier to adjust your threshold and apply iteratively:

remove_dissimilar <- function(df, threshold = 0.8) {
  
  pairs <- expand.grid(names(df), names(df))
  pairs <- pairs[lower.tri(replicate(length(df), names(df))),]
  pairs$similarity <- apply(pairs, 1, function(x) {
    sum(df[x[1]] == df[x[2]])/nrow(df)})
  pairs <- pairs[which(pairs$similarity > threshold),]
  keep_cols <- as.character(sort(unique(c(pairs$Var1, pairs$Var2))))
  df[match(keep_cols, names(df))]
}

So now you could just do:

remove_dissimilar(df, 0.8)
#>   x1 x2 x4
#> 1  0  0  0
#> 2  0  0  0
#> 3  1  1  1
#> 4  1  1  1
#> 5  2  2  1
#> 6  2  2  2

remove_dissimilar(df, 0.9)
#>   x1 x2
#> 1  0  0
#> 2  0  0
#> 3  1  1
#> 4  1  1
#> 5  2  2
#> 6  2  2
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文