比较两条用户定义的曲线并对它们的相似度进行评分

发布于 2024-11-30 14:58:09 字数 922 浏览 0 评论 0原文

我有一组两条曲线(每条曲线都有几百到几千个数据点),我想对其进行比较并获得一些相似性“分数”。实际上,我有超过 100 个这样的集合可供比较...我熟悉 R(或至少是 bioconductor)并且想使用它。

我尝试了 ccf() 函数,但我对此不太满意。

例如,如果我将 c1 与以下曲线进行比较:

c1 <- c(0, 0.8, 0.9, 0.9, 0.5, 0.1, 0.5)

c1b <- c(0, 0.8, 0.9, 0.9, 0.5, 0.1, 0.5) # perfect match! ideally score of 1

c1c <- c(1, 0.2, 0.1, 0.1, 0.5, 0.9, 0.5) # total opposite, ideally score of -1? (what would 0 be though?)

c2 <- c(0, 0.9, 0.9, 0.9, 0, 0.3, 0.3, 0.9) #pretty good, score of ???

请注意,向量没有相同的大小,并且需要以某种方式进行标准化......有什么想法吗? 如果你看一下这两条线,它们非常相似,我认为第一步,测量两条曲线下方的面积并减去就可以了。我查看了帖子“R 中两条曲线下的阴影区域”,但这并不完全是我所需要的。

第二个问题(可选)是,对于具有相同轮廓但不同幅度的线,我想将它们评分为非常相似,即使它们下面的面积很大:

c1 <- c(0, 0.8, 0.9, 0.9, 0.5, 0.1, 0.5)

c4 <- c(0, 0.6, 0.7, 0.7, 0.3, 0.1, 0.3) # very good, score of ??

我希望假装向程序员提出问题的生物学家是好的...

如果需要的话,我很乐意提供一些现实生活中的例子。

提前致谢!

I have a set of 2 curves (each with a few hundreds to a couple thousands datapoints) that I want to compare and get some similarity "score". Actually, I have >100 of those sets to compare... I am familiar with R (or at least bioconductor) and would like to use it.

I tried the ccf() function but I'm not too happy about it.

For example, if I compare c1 to the following curves:

c1 <- c(0, 0.8, 0.9, 0.9, 0.5, 0.1, 0.5)

c1b <- c(0, 0.8, 0.9, 0.9, 0.5, 0.1, 0.5) # perfect match! ideally score of 1

c1c <- c(1, 0.2, 0.1, 0.1, 0.5, 0.9, 0.5) # total opposite, ideally score of -1? (what would 0 be though?)

c2 <- c(0, 0.9, 0.9, 0.9, 0, 0.3, 0.3, 0.9) #pretty good, score of ???

Note that the vectors don't have the same size and it needs to be normalized, somehow... Any idea?
If you look at those 2 lines, they are fairly similar and I think that in a first step, measuring the area under the 2 curves and subtracting would do. I look at the post "Shaded area under 2 curves in R" but that is not quite what I need.

A second issue (optional) is that for lines that have the same profile but different amplitude, I would like to score those as very similar even though the area under them would be big:

c1 <- c(0, 0.8, 0.9, 0.9, 0.5, 0.1, 0.5)

c4 <- c(0, 0.6, 0.7, 0.7, 0.3, 0.1, 0.3) # very good, score of ??

I hope that a biologist pretending to formulate problem to programmer is OK...

I'd be happy to provide some real life examples if needed.

Thanks in advance!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

清风疏影 2024-12-07 14:58:09

它们不会形成通常意义上的成对 xy 值的曲线,除非它们长度相等。前三个长度相等,封装在矩阵中后,HMisc 包中的 rcorr 函数返回:

> rcorr(as.matrix(dfrm))[[1]]
    c1 c1b c1c
c1   1   1  -1
c1b  1   1  -1
c1c -1  -1   1   # as desired if you scaled them to 0-1

c1 和 c4 向量的相关性:

> cor( c(0, 0.8, 0.9, 0.9, 0.5, 0.1, 0.5),
  c(0, 0.6, 0.7, 0.7, 0.3, 0.1, 0.3) )
[1] 0.9874975

They don't form curves in the usual meaning of paired x.y values unless they are of equal length. The first three are of equal length and after packaging in a matrix the rcorr function in HMisc package returns:

> rcorr(as.matrix(dfrm))[[1]]
    c1 c1b c1c
c1   1   1  -1
c1b  1   1  -1
c1c -1  -1   1   # as desired if you scaled them to 0-1

The correlation of the c1 and c4 vectors:

> cor( c(0, 0.8, 0.9, 0.9, 0.5, 0.1, 0.5),
  c(0, 0.6, 0.7, 0.7, 0.3, 0.1, 0.3) )
[1] 0.9874975
谁的新欢旧爱 2024-12-07 14:58:09

我没有一个很好的答案,但我过去确实遇到过类似的问题,可能不止一次。我的方法是在我主观评估曲线时回答自己,是什么让我的曲线相似(这里的科学术语是“眼球”:)。是曲线下面积吗?我是否将曲线的线性平移、旋转或缩放(缩放)视为导致差异的原因?如果不是,我会通过选定的归一化去掉所有我不关心的因素(例如缩放曲线以覆盖 x 和 y 中的相同范围)。

我相信这个话题有一个严格的数学理论,我会搜索“亲和力”“仿射”这些词。也就是说,我的原始/天真的方法通常足以满足我正在做的工作。

您可能想在某些数学论坛上问这个问题。

I do not have a very good answer, but I did face similar question in the past, probably on more than 1 occasion. My approach is to answer to myself what makes my curves similar when I subjectively evaluate them (the scientific term here is "eye-balling" :). Is it the area under the curve? Do I count linear translation, rotation, or scaling (zoom) of my curves as contributing to dissimilarity? If not, I take out all the factors that I do not care about by selected normalization (e.g. scale the curves to cover the same ranges in x and y).

I am confident that there is a rigorous mathematical theory for this topic, I would search for the words "affinity" "affine". That said, my primitive/naive methods usually sufficed for the work I was doing.

You may want to ask this question on some math forum.

婴鹅 2024-12-07 14:58:09

如果您比较的蛋白质是相当接近的直向同源物,您应该能够获得您想要评分相似性的每一对的比对,或整个蛋白质组的多重比对。根据应用情况,我认为后者会更严格。然后,我将仅提取那些对齐的氨基酸的折叠分数,以便所有配置文件具有相同的长度,并计算配置文件的相关性度量或平方归一化点积作为相似性度量。平方归一化点积或斯皮尔曼等级相关性对幅度差异不太敏感,这似乎是您想要的。这将确保您比较的是合理配对的元素(在比对合理的范围内),并让您回答诸如“所比较蛋白质中的相应残基通常折叠到相似程度吗?”之类的问题。

If the proteins you compare are reasonably close orthologs, you should be able to obtain alignments for either each pair you want to score the similarity of, or a multiple alignment for the entire bunch. Depending on the application, I think the latter will be more rigorous. I would then extract the folding score of only those amino acids that are aligned so that all profiles have the same length, and calculate correlation measures or squared normalized dot-products of the profiles as a similarity measure. The squared normalized dot product or the spearman rank correlation will be less sensitive to amplitude differences, which you seem to want. That will make sure you are comparing elements which are reasonable paired (to the extent the alignment is reasonable), and will let you answer questions like: "Are corresponding residues in the compared proteins generally folded to a similar extent?".

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文