修剪基于r的行列相似性的数据框
我有一个非常大的基因组基因座数据框架,其基因型得分为0、1或2。这是一个很小的样本,我认为这是一个很小的样本:
x1 x2 x3 x4
0 0 1 0
0 0 1 0
1 1 2 1
1 1 1 1
2 2 0 1
2 2 1 2
loci x1和x2相同,而x4高度相似。我希望实现的目标是创建一个函数,或使用已经存在的函数,为我的每个基因座分配相似性分数,然后根据我设置的阈值相似性来修剪数据集。
例如,如果我将阈值设置为1(100%),则它将仅修剪X1和X2,因为它们是重复的 - 我知道该怎么做。但是,如果我将阈值设置为0.8或80%的相似性,则除了X1和X2之外,它还还会修剪X4。
重要的是,该功能以行明相似性起作用,而不仅仅是比较列具有相似的0,1和2的分布。
I have a very large dataframe of genomic loci with genotypes scored as 0, 1, or 2. Here is a very small sample that I think gets at the issue:
x1 x2 x3 x4
0 0 1 0
0 0 1 0
1 1 2 1
1 1 1 1
2 2 0 1
2 2 1 2
Loci x1 and x2 are identical while x4 is highly similar. What I am hoping to achieve is to create a function, or use one that already exists, to assign similarity scores, row-wise, for each of my loci and then prune the dataset based on a threshold similarity that I set.
For example, if I set the threshold at 1 (100%), it would prune only x1 and x2 as they are duplicates - which I know how to do. However, if I set the threshold at 0.8, or 80% similarity, it would also prune x4 in addition to x1 and x2.
It's important that the function acts on row-wise similarity and doesn't just compare that columns have similar distributions of 0's, 1's, and 2's.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这就是我要处理的方式。
首先,获取列名称的所有唯一配对的列表:
现在通过此迭代,以比较原始数据集的每对唯一一对列中相同的行比例。这为您提供每个列对0到1之间的相似性分数:
现在删除该列表的所有行,这些行的相似性得分低于您选择的阈值(我们将在此处为0.8),
现在我们在中提取所有唯一列名称
var1
和var2
,因为这些是与至少另一列相似的列:我们使用此列将原始数据框架子集来获得我们的期望结果:
当然,您可以将所有这些功能放在一个功能中,以使调整阈值并迭代应用更容易:
因此,现在您可以做:
Here's how I would approach this.
First, get a listing of all the unique pairings of your column names:
Now iterate through this to compare the proportion of rows that are identical in each unique pair of columns of your original data set. This gives you a similarity score between 0 to 1 for each column pair:
Now remove all the rows of this listing that have a similarity score below your chosen threshold (we'll make it 0.8 here)
Now we extract all the unique column names in
Var1
andVar2
, since these are the columns that are similar to at least one other column:And we subset our original data frame using this to get our desired result:
Of course, you could put all this in a function to make it easier to adjust your threshold and apply iteratively:
So now you could just do: