用于比较高级语言(例如 Javascript)数据图中的相似性的数学库?
我正在寻找一些我认为相当复杂并且可能不公开存在的东西,但希望它确实存在。
我基本上有一个包含很多项目的数据库,这些项目都有与其他值 (x) 相对应的值 (y)。例如。其中一项可能看起来像:
x | 1 | 2 | 3 | 4 | 5
y | 12 | 14 | 16 | 8 | 6
这只是一个随机示例。现在,有数千个这样的项目都有自己的一组 x 和 y 值。一个 x 和其后的 x 之间的范围不是固定的,并且对于每个项目可能有所不同。
我正在寻找的是一个库,我可以在其中插入所有这些 X 和 Y 集,并告诉它返回最常见的项目(遵循可比较曲线/级数的 x 和 y 集)之类的内容,以及能够检查某个集合是否与另一个集合至少有 x% 的可比性。
如果您要绘制数据图表,则“可比较”是指曲线的斜率。因此,实际上不是静态值,而是事件的检测,例如高增加然后缓慢减少等。
由于我在数学方面的经验较少,我不太确定我正在寻找的东西被称为,因此很难解释我需要什么。希望我给了足够的指示,让有人指出我正确的方向。
我最感兴趣的是 javascript 库,但如果没有这样的东西,任何库都会有帮助,也许我可以尝试移植我需要的东西。
I'm looking for something that I guess is rather sophisticated and might not exist publicly, but hopefully it does.
I basically have a database with lots of items which all have values (y) that correspond to other values (x). Eg. one of these items might look like:
x | 1 | 2 | 3 | 4 | 5
y | 12 | 14 | 16 | 8 | 6
This is just a a random example. Now, there are thousands of these items all with their own set of x and y values. The range between one x and the x after that one is not fixed and may differ for every item.
What I'm looking for is a library where I can plugin all these sets of Xs and Ys and tell it to return things like the most common item (sets of x and y that follow a compareable curve / progression), and the ability to check whether a certain set is atleast x% compareable with another set.
With compareable I mean the slope of the curve if you would draw a graph of the data. So, not actaully the static values but rather the detection of events, such as a high increase followed by a slow decrease, etc.
Due to my low amount of experience in mathematics I'm not quite sure what I'm looking for is called, and thus have trouble explaining what I need. Hopefully I gave enough pointers for someone to point me into the right direction.
I'm mostly interested in a library for javascript, but if there is no such thing any library would help, maybe I can try to port what I need.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
再次关于马尔可夫集群(ing),我恰好是该集群的作者,以及您的应用程序。您提到您对对象之间的趋势相似性感兴趣。这通常使用 Pearson 相关性来计算。如果您使用 http://micans.org/mcl/ 中的 mcl 实现,您还将获得程序“mcxarray”。这可以用于计算表中的行之间的皮尔逊相关性。它可能对你有用。它能够处理丢失的数据——用一种简单的方法,它只是计算那些值对于两者都可用的索引的相关性。如果您还有其他问题,我很乐意回答——但需要注意的是,我通常喜欢将回复抄送至 mcl 邮件列表,以便将它们存档并可供将来参考。
About Markov Cluster(ing) again, of which I happen to be the author, and your application. You mention you are interested in trend similarity between objects. This is typically computed using Pearson correlation. If you use the mcl implementation from http://micans.org/mcl/, you'll also obtain the program 'mcxarray'. This can be used to compute pearson correlations between e.g. rows in a table. It might be useful to you. It is able to handle missing data - in a simplistic approach, it just computes correlations on those indices for which values are available for both. If you have further questions I am happy to answer them -- with the caveat that I usually like to cc replies to the mcl mailing list so that they are archived and available for future reference.
您正在寻找的是马尔可夫聚类的实现。它通常用于查找相似序列的组。将其移植到 Javascript 中,好吧...如果您真的认真对待此分析,请尽快放弃 Javascript 并转向 R。Javascript 不适合进行此类计算,而且它对于它。 R 是一个已经实现了很多的统计软件包。它也是专门为非常快速的矩阵计算而设计的,并且大多数语言都是矢量化的(这意味着您不需要 for 循环来将函数应用于值向量,它会自动发生)
对于马尔可夫聚类,请检查 http://www.micans.org/mcl/
实现示例: http://www.orthomcl.org/cgi-bin/OrthoMclWeb .cgi
现在您还需要定义集合之间的“距离”。由于您对事件而不是值感兴趣,因此您可以为每个项目提供一个额外的属性,即具有差异 y[i] - y[i-1] (在 R 中: diff(y) )的向量。然后,两个项目之间的距离可以计算为 y1[i] 和 y2[i] 之间的平方差之和。
这允许您构建项目的距离矩阵,并在该矩阵上调用 mcl 算法。除非你在 Linux 上工作,否则你必须移植它。
What you're looking for is an implementation of a Markov clustering. It is often used for finding groups of similar sequences. Porting it to Javascript, well... If you're really serious about this analysis, you drop Javascript as soon as possible and move on to R. Javascript is not meant to do this kind of calculations, and it is far too slow for it. R is a statistical package with much implemented. It is also designed specifically for very speedy matrix calculations, and most of the language is vectorized (meaning you don't need for-loops to apply a function over a vector of values, it happens automatically)
For the markov clustering, check http://www.micans.org/mcl/
An example of an implementation : http://www.orthomcl.org/cgi-bin/OrthoMclWeb.cgi
Now you also need to define a "distance" between your sets. As you are interested in the events and not the values, you could give every item an extra attribute being a vector with the differences y[i] - y[i-1] (in R : diff(y) ). The distance between two items can then be calculated as the sum of squared differences between y1[i] and y2[i].
This allows you to construct a distance matrix of your items, and on that one you can call the mcl algorithm. Unless you work on linux, you'll have to port that one.
您想要做的是方差分析或方差分析。如果您通过方差分析测试运行这些数字,它将为您提供有关数据集的信息,帮助您将数据集与另一个数据集进行比较。我无法找到可以执行 ANOVA 的 Javascript 库,但有很多程序可以执行此操作。 Excel 可以通过插件执行方差分析。 R 是一个免费的统计包,也可以执行方差分析。
希望这有帮助。
What you're wanting to do is ANOVA, or ANalysis Of VAriance. If you run the numbers through an ANOVA test, it'll give you information about the dataset that will help you compare one to another. I was unable to locate a Javascript library that would perform ANOVA, but there are plenty of programs that are capable of it. Excel can perform ANOVA from a plugin. R is a stats package that is free and can also perform ANOVA.
Hope this helps.
简单的是(假设所有图都有 5 个点,并且 x = 1,2,3,4,5 总是)
现在将向量 u 视为 5 维空间中的一个点。您可以使用简单的聚类算法,例如 k-means。
编辑:只要你使用 javascript,你就不应该追求太复杂的东西。如果你想使用Java,我可以建议一些基于PCA的东西(需要使用奇异值分解,这太复杂而无法在JS中有效实现)。
基本上,它是这样的:像以前一样采用数据的(可能很大)线性表示,可能是 x、y 的分量、绝对值的差异。例如你可以采取
u = (x1, x2 - x1, ..., x5 - x4, y1, y2 - y1, ..., y5 - y4)
您计算每个样本的向量 u。将 ui 称为第 i 个样本的向量 u。现在,形成矩阵
M_{ij} = ui 和 uj 的点积
并计算其 SVD 。现在,N 个最显着的奇异值(即高于某个“相似性阈值”的值)为您提供 N 个聚类。
SVD 中矩阵 U 的相应列给出了正交族 B_k, k = 1..N。 B_k 的第 i 个分量的平方给出了第 i 个样本属于 K 簇的概率。
Something simple is (assuming all the graphs have 5 points, and x = 1,2,3,4,5 always)
Now consider the vector u as a point in 5-dimensional space. You can use simple clustering algorithms, like k-means.
EDIT: You should not aim for something too complicated as long as you go with javascript. If you want to go with Java, I can suggest something based on PCA (requiring the use of singular value decomposition, which is too complicated to be implemented efficiently in JS).
Basically, it goes like this: Take as previously a (possibly large) linear representation of data, perhaps differences of components of x, of y, absolute values. For instance you could take
u = (x1, x2 - x1, ..., x5 - x4, y1, y2 - y1, ..., y5 - y4)
You compute the vector u for each sample. Call ui the vector u for the ith sample. Now, form the matrix
M_{ij} = dot product of ui and uj
and compute its SVD. Now, the N most significant singular values (ie. those above some "similarity threshold") give you N clusters.
The corresponding columns of the matrix U in the SVD give you an orthonormal family B_k, k = 1..N. The squared ith component of B_k gives you the probability that the ith sample belongs to cluster K.
如果可以使用java,你真的应该看看Weka。可以通过 java 代码访问所有功能。也许您找到了马尔可夫聚类,但如果没有,他们还有很多其他聚类算法,而且非常易于使用。
If it is ok to use java you really should have a look at Weka. It is possible to access all features via java code. Maybe you find a markov clustering, but if not, they hava a lot other clustering algorithem and its really easy to use.