比较两条用户定义的曲线并对它们的相似度进行评分

发布于 2024-11-30 14:58:09 字数 922 浏览 0 评论 0原文

我有一组两条曲线（每条曲线都有几百到几千个数据点），我想对其进行比较并获得一些相似性“分数”。实际上，我有超过 100 个这样的集合可供比较...我熟悉 R（或至少是 bioconductor）并且想使用它。

我尝试了 ccf() 函数，但我对此不太满意。

例如，如果我将 c1 与以下曲线进行比较：

c1 <- c(0, 0.8, 0.9, 0.9, 0.5, 0.1, 0.5)

c1b <- c(0, 0.8, 0.9, 0.9, 0.5, 0.1, 0.5) # perfect match! ideally score of 1

c1c <- c(1, 0.2, 0.1, 0.1, 0.5, 0.9, 0.5) # total opposite, ideally score of -1? (what would 0 be though?)

c2 <- c(0, 0.9, 0.9, 0.9, 0, 0.3, 0.3, 0.9) #pretty good, score of ???

请注意，向量没有相同的大小，并且需要以某种方式进行标准化......有什么想法吗？如果你看一下这两条线，它们非常相似，我认为第一步，测量两条曲线下方的面积并减去就可以了。我查看了帖子“R 中两条曲线下的阴影区域”，但这并不完全是我所需要的。

第二个问题（可选）是，对于具有相同轮廓但不同幅度的线，我想将它们评分为非常相似，即使它们下面的面积很大：

c1 <- c(0, 0.8, 0.9, 0.9, 0.5, 0.1, 0.5)

c4 <- c(0, 0.6, 0.7, 0.7, 0.3, 0.1, 0.3) # very good, score of ??

我希望假装向程序员提出问题的生物学家是好的...

如果需要的话，我很乐意提供一些现实生活中的例子。

提前致谢！

原文

I have a set of 2 curves (each with a few hundreds to a couple thousands datapoints) that I want to compare and get some similarity "score". Actually, I have >100 of those sets to compare... I am familiar with R (or at least bioconductor) and would like to use it.

I tried the ccf() function but I'm not too happy about it.

For example, if I compare c1 to the following curves:

c1 <- c(0, 0.8, 0.9, 0.9, 0.5, 0.1, 0.5)

c1b <- c(0, 0.8, 0.9, 0.9, 0.5, 0.1, 0.5) # perfect match! ideally score of 1

c1c <- c(1, 0.2, 0.1, 0.1, 0.5, 0.9, 0.5) # total opposite, ideally score of -1? (what would 0 be though?)

c2 <- c(0, 0.9, 0.9, 0.9, 0, 0.3, 0.3, 0.9) #pretty good, score of ???

Note that the vectors don't have the same size and it needs to be normalized, somehow... Any idea?
If you look at those 2 lines, they are fairly similar and I think that in a first step, measuring the area under the 2 curves and subtracting would do. I look at the post "Shaded area under 2 curves in R" but that is not quite what I need.

A second issue (optional) is that for lines that have the same profile but different amplitude, I would like to score those as very similar even though the area under them would be big:

c1 <- c(0, 0.8, 0.9, 0.9, 0.5, 0.1, 0.5)

c4 <- c(0, 0.6, 0.7, 0.7, 0.3, 0.1, 0.3) # very good, score of ??

I hope that a biologist pretending to formulate problem to programmer is OK...

I'd be happy to provide some real life examples if needed.

Thanks in advance!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

清风疏影 2024-12-07 14:58:09

它们不会形成通常意义上的成对 xy 值的曲线，除非它们长度相等。前三个长度相等，封装在矩阵中后，HMisc 包中的 rcorr 函数返回：

> rcorr(as.matrix(dfrm))[[1]]
    c1 c1b c1c
c1   1   1  -1
c1b  1   1  -1
c1c -1  -1   1   # as desired if you scaled them to 0-1

c1 和 c4 向量的相关性：

> cor( c(0, 0.8, 0.9, 0.9, 0.5, 0.1, 0.5),
  c(0, 0.6, 0.7, 0.7, 0.3, 0.1, 0.3) )
[1] 0.9874975

They don't form curves in the usual meaning of paired x.y values unless they are of equal length. The first three are of equal length and after packaging in a matrix the rcorr function in HMisc package returns:

> rcorr(as.matrix(dfrm))[[1]]
    c1 c1b c1c
c1   1   1  -1
c1b  1   1  -1
c1c -1  -1   1   # as desired if you scaled them to 0-1

The correlation of the c1 and c4 vectors:

> cor( c(0, 0.8, 0.9, 0.9, 0.5, 0.1, 0.5),
  c(0, 0.6, 0.7, 0.7, 0.3, 0.1, 0.3) )
[1] 0.9874975

回复收藏 0 原文

谁的新欢旧爱 2024-12-07 14:58:09

我没有一个很好的答案，但我过去确实遇到过类似的问题，可能不止一次。我的方法是在我主观评估曲线时回答自己，是什么让我的曲线相似（这里的科学术语是“眼球”:)。是曲线下面积吗？我是否将曲线的线性平移、旋转或缩放（缩放）视为导致差异的原因？如果不是，我会通过选定的归一化去掉所有我不关心的因素（例如缩放曲线以覆盖 x 和 y 中的相同范围）。

我相信这个话题有一个严格的数学理论，我会搜索“亲和力”“仿射”这些词。也就是说，我的原始/天真的方法通常足以满足我正在做的工作。

您可能想在某些数学论坛上问这个问题。

回复收藏 0 原文

婴鹅 2024-12-07 14:58:09

如果您比较的蛋白质是相当接近的直向同源物，您应该能够获得您想要评分相似性的每一对的比对，或整个蛋白质组的多重比对。根据应用情况，我认为后者会更严格。然后，我将仅提取那些对齐的氨基酸的折叠分数，以便所有配置文件具有相同的长度，并计算配置文件的相关性度量或平方归一化点积作为相似性度量。平方归一化点积或斯皮尔曼等级相关性对幅度差异不太敏感，这似乎是您想要的。这将确保您比较的是合理配对的元素（在比对合理的范围内），并让您回答诸如“所比较蛋白质中的相应残基通常折叠到相似程度吗？”之类的问题。

回复收藏 0 原文

~没有更多了~