比较二维数据/散点图组
我有 2000 组数据,每组数据包含 1000 多个 2D 变量。我希望根据相似性将这些数据集聚类为 20-100 个聚类。然而,我很难想出一种比较数据集的可靠方法。我尝试了一些(相当原始的)方法并进行了大量研究,但我似乎找不到任何适合我需要做的事情。
我在下面发布了绘制的 3 组数据的图像。数据在 y 轴上的范围为 0-1,在 x 轴上的范围为 ~0-0.10(实际上,但理论上可能大于 0.10)。
数据的形状和相对比例可能是最重要的比较内容。然而,每个数据集的绝对位置也很重要。换句话说,每个单独点与另一个数据集的单独点的相对位置越接近,它们就越相似,然后需要考虑它们的绝对位置。
绿色和红色应该被认为是非常不同的,但到了紧要关头,它们应该比蓝色和红色更相似。
我尝试过:
- 基于整体进行比较超额和偏差
- 将变量分割为坐标区域(即(0-0.10, 0-0.10)、(0.10-0.20, 0.10-0.20)...(0.9-1.0, 0.9-1.0))并基于共享点比较相似度在区域内,
- 我尝试测量数据集中到最近邻居的平均欧几里德距离
所有这些都产生了错误的结果。我在研究中找到的最接近的答案是“多组 2D 坐标的适当相似性度量”。然而,那里给出的答案表明比较最近邻居与质心之间的平均距离,我认为这对我来说不起作用,因为方向与我的目的的距离一样重要。
我可能会补充一点,这将用于生成另一个程序的输入数据,并且只会偶尔使用(主要是生成具有不同数量簇的不同数据集),因此半耗时算法并非不可能。
I have 2000 sets of data which contain little over 1000 2D variables each. I'm looking to cluster these sets of data into anywhere from 20-100 clusters based on similarity. However, I'm having trouble coming up with a reliable method of comparing sets of data. I've tried a few (rather primitive) approaches and done loads of research, but I can't seem to find anything that fits what I need to do.
I've posted an image below of 3 sets of my data plotted. The data is bounded 0-1 in the y axis, and is within the ~0-0.10 range in the x axis (in practice, but could be greater then 0.10 in theory).
The shape and relative proportions of the data are probably the most important things to compare. However, the absolute locations of each data set are important as well. In other words, the closer the relative position of each individual point to the individual points of another dataset, the more similar they would be and then their absolute positions would need to be accounted for.
Green and red should be considered as very different, but push comes to shove, they should be more similar than blue and red.
I have tried to:
- compare based on overall overages and deviation
- split the variables into coordinate regions (ie (0-0.10, 0-0.10), (0.10-0.20, 0.10-0.20)...(0.9-1.0, 0.9-1.0)) and compare similarity based on shared points within regions
- I've tried measuring the average euclidean distance to nearest neighbours among the data sets
All of these have produced faulty results. The closest answer I could find in my research was "Appropriate similarity metrics for multiple sets of 2D coordinates". However, the answer given there suggests comparing the average distance among nearest neighbours from the centroid, which I don't think will work for me as the direction, is as important as the distance for my purposes.
I might add, that this will be used to generate data for the input of another program and will only be used sporadically (mainly to generate different sets of data with different numbers of clusters), so semi time consuming algorithms are not out of the question.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
分两步
1)第一:区分布鲁斯。
计算平均最近邻距离,直到截止值。选择类似于下图中黑色距离的截止值:
蓝色配置,因为它们更加分散会给你带来比红色和绿色更好的结果。
2) 第二:区分红色和绿色
忽略最近邻距离大于或等于较小值的所有点(例如前一个距离的四分之一)。对邻近度进行聚类,以获得以下形式的聚类:
丢弃少于 10 个点(左右)的簇。对于每个簇运行线性拟合并计算协方差。红色的平均协方差将比绿色高得多,因为绿色在此比例上非常一致。
你就在那里。
哈!
In two steps
1) First: To tell apart blues.
Compute the mean nearest neighbor distance, up to a cutoff. Select the cutoff something like the black distance in the following image:
The blue configurations, as they are more scattered will give you results much greater than the reds and greens.
2) Second: To tell apart reds and greens
Disregard all points whose nearest neighbor distance is more than something smaller (for example one fourth of the previous distance). Clusterize for proximity so to get clusters of the form:
Discard the clusters with less than 10 points (or so). For each cluster run a linear fit and calculate covariances. The mean covariance for red will be much higher than for green since greens are very aligned in this scale.
There you are.
HTH!
虽然贝利撒留已经很好地回答了这个问题,但这里有一些评论:
如果你可以将每组 1000 个点减少为 32 个簇,每个簇 32 个点
(或 20 x 50 或...),那么您可以在 32 空间而不是 1000 空间中工作。
为此尝试 K-means 聚类;
参见
SO questions/tagged/k-means。
测量 A、B 组(点、簇)之间距离的一种方法
就是采用这样的最近对:
然后您可以将 32 个空间中的 2000 个点聚类到 20 个聚类中心
一口气:(
通常的欧几里得距离在这里根本不起作用。)
请跟进 - 告诉我们什么最终有效,什么无效。
Although belisarius has answered this well, here are a couple of comments:
if you could reduce each set of 1000 points to say 32 clusters of 32 points each
(or 20 x 50 or ...), then you could work in 32-space instead of 1000-space.
Try K-means clustering for this;
see also
SO questions/tagged/k-means.
One way to measure distance between sets A, B (of points, of clusters)
is to take nearest pairs like this:
You could then cluster your 2000 points in 32-space to 20 cluster centres
in one shot:
(the usual Euclidean distance wouldn't work here at all.)
Please follow-up — tell us what worked in the end, and what didn't.