比较二维数据/散点图组

发布于 2024-10-16 04:19:03 字数 924 浏览 10 评论 0原文

我有 2000 组数据，每组数据包含 1000 多个 2D 变量。我希望根据相似性将这些数据集聚类为 20-100 个聚类。然而，我很难想出一种比较数据集的可靠方法。我尝试了一些（相当原始的）方法并进行了大量研究，但我似乎找不到任何适合我需要做的事情。

我在下面发布了绘制的 3 组数据的图像。数据在 y 轴上的范围为 0-1，在 x 轴上的范围为 ~0-0.10（实际上，但理论上可能大于 0.10）。

数据的形状和相对比例可能是最重要的比较内容。然而，每个数据集的绝对位置也很重要。换句话说，每个单独点与另一个数据集的单独点的相对位置越接近，它们就越相似，然后需要考虑它们的绝对位置。

绿色和红色应该被认为是非常不同的，但到了紧要关头，它们应该比蓝色和红色更相似。

我尝试过：

基于整体进行比较超额和偏差
将变量分割为坐标区域（即（0-0.10, 0-0.10）、（0.10-0.20, 0.10-0.20）...（0.9-1.0, 0.9-1.0））并基于共享点比较相似度在区域内，
我尝试测量数据集中到最近邻居的平均欧几里德距离

所有这些都产生了错误的结果。我在研究中找到的最接近的答案是“多组 2D 坐标的适当相似性度量”。然而，那里给出的答案表明比较最近邻居与质心之间的平均距离，我认为这对我来说不起作用，因为方向与我的目的的距离一样重要。

我可能会补充一点，这将用于生成另一个程序的输入数据，并且只会偶尔使用（主要是生成具有不同数量簇的不同数据集），因此半耗时算法并非不可能。

原文

I have 2000 sets of data which contain little over 1000 2D variables each. I'm looking to cluster these sets of data into anywhere from 20-100 clusters based on similarity. However, I'm having trouble coming up with a reliable method of comparing sets of data. I've tried a few (rather primitive) approaches and done loads of research, but I can't seem to find anything that fits what I need to do.

I've posted an image below of 3 sets of my data plotted. The data is bounded 0-1 in the y axis, and is within the ~0-0.10 range in the x axis (in practice, but could be greater then 0.10 in theory).

The shape and relative proportions of the data are probably the most important things to compare. However, the absolute locations of each data set are important as well. In other words, the closer the relative position of each individual point to the individual points of another dataset, the more similar they would be and then their absolute positions would need to be accounted for.

Green and red should be considered as very different, but push comes to shove, they should be more similar than blue and red.

I have tried to:

compare based on overall overages and deviation
split the variables into coordinate regions (ie (0-0.10, 0-0.10), (0.10-0.20, 0.10-0.20)...(0.9-1.0, 0.9-1.0)) and compare similarity based on shared points within regions
I've tried measuring the average euclidean distance to nearest neighbours among the data sets

All of these have produced faulty results. The closest answer I could find in my research was "Appropriate similarity metrics for multiple sets of 2D coordinates". However, the answer given there suggests comparing the average distance among nearest neighbours from the centroid, which I don't think will work for me as the direction, is as important as the distance for my purposes.

I might add, that this will be used to generate data for the input of another program and will only be used sporadically (mainly to generate different sets of data with different numbers of clusters), so semi time consuming algorithms are not out of the question.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

在你怀里撒娇 2024-10-23 04:19:03

分两步

1）第一：区分布鲁斯。

计算平均最近邻距离，直到截止值。选择类似于下图中黑色距离的截止值：

在此处输入图像描述

蓝色配置，因为它们更加分散会给你带来比红色和绿色更好的结果。

2) 第二：区分红色和绿色

忽略最近邻距离大于或等于较小值的所有点（例如前一个距离的四分之一）。对邻近度进行聚类，以获得以下形式的聚类：

在此处输入图像描述和

丢弃少于 10 个点（左右）的簇。对于每个簇运行线性拟合并计算协方差。红色的平均协方差将比绿色高得多，因为绿色在此比例上非常一致。

你就在那里。

哈！

回复收藏 0 原文

命硬 2024-10-23 04:19:03

虽然贝利撒留已经很好地回答了这个问题，但这里有一些评论：

如果你可以将每组 1000 个点减少为 32 个簇，每个簇 32 个点
（或 20 x 50 或...），那么您可以在 32 空间而不是 1000 空间中工作。
为此尝试 K-means 聚类；
参见
SO questions/tagged/k-means。

测量 A、B 组（点、簇）之间距离的一种方法
就是采用这样的最近对：

def nearestpairsdistance( A, B ):
    """ large point sets A, B -> nearest b each a, nearest a each b """
        # using KDTree, http://docs.scipy.org/doc/scipy/reference/spatial.html
    Atree = KDTree( A )
    Btree = KDTree( B )
    a_nearestb, ixab = Btree.query( A, k=1, p=p, eps=eps )  # p=inf is fast
    b_nearesta, ixba = Atree.query( B, k=1, p=p, eps=eps )
    if verbose:
        print "a_nearestb:", nu.quantiles5(a_nearestb)
        print "b_nearesta:", nu.quantiles5(b_nearesta)
    return (np.median(a_nearestb) + np.median(b_nearesta)) / 2
        # means are sensitive to outliers; fast approx median ?

然后您可以将 32 个空间中的 2000 个点聚类到 20 个聚类中心
一口气：（

centres, labels = kmeans( points, k=20, iter=3, distance=nearestpairsdistance )

通常的欧几里得距离在这里根本不起作用。）

请跟进 - 告诉我们什么最终有效，什么无效。

Although belisarius has answered this well, here are a couple of comments:

if you could reduce each set of 1000 points to say 32 clusters of 32 points each
(or 20 x 50 or ...), then you could work in 32-space instead of 1000-space.
Try K-means clustering for this;
see also
SO questions/tagged/k-means.

One way to measure distance between sets A, B (of points, of clusters)
is to take nearest pairs like this:

def nearestpairsdistance( A, B ):
    """ large point sets A, B -> nearest b each a, nearest a each b """
        # using KDTree, http://docs.scipy.org/doc/scipy/reference/spatial.html
    Atree = KDTree( A )
    Btree = KDTree( B )
    a_nearestb, ixab = Btree.query( A, k=1, p=p, eps=eps )  # p=inf is fast
    b_nearesta, ixba = Atree.query( B, k=1, p=p, eps=eps )
    if verbose:
        print "a_nearestb:", nu.quantiles5(a_nearestb)
        print "b_nearesta:", nu.quantiles5(b_nearesta)
    return (np.median(a_nearestb) + np.median(b_nearesta)) / 2
        # means are sensitive to outliers; fast approx median ?

You could then cluster your 2000 points in 32-space to 20 cluster centres
in one shot:

centres, labels = kmeans( points, k=20, iter=3, distance=nearestpairsdistance )

(the usual Euclidean distance wouldn't work here at all.)

Please follow-up — tell us what worked in the end, and what didn't.

回复收藏 0 原文

~没有更多了~

关于作者

∞觅青森が

暂无简介

文章

26 人气

关注发私信

友情链接

文江博客

比较二维数据/散点图组

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

qq_VRzBBA45

痴情

。

Mu.

凉薄对峙

不落城

友情链接

比较二维数据/散点图组

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

qq_VRzBBA45

痴情

。

Mu.

凉薄对峙

不落城

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。