如何选择 Canopy 聚类的 T1 和 T2 阈值?
我正在尝试与 K 均值一起实现 Canopy 聚类算法。我在网上做了一些搜索,说使用 Canopy 聚类将初始起点输入 K 均值,问题是,在 Canopy 聚类中,您需要为 Canopy 指定 2 个阈值:T1 和 T2,其中内部阈值中的点与该树冠紧密相关,而较宽阈值中的点与该树冠联系较少。这些阈值或距树冠中心的距离是如何确定的?
问题上下文:
我试图解决的问题是,我有一组数字,例如 [1,30] 或 [1,250],其大小约为 50。可以有重复的元素,并且它们可以是浮点数,如下所示好吧,比如 8、17.5、17.5、23、66……我想找到最佳的簇,或者数字集的子集。
因此,如果使用 K 均值进行 Canopy 聚类是一个不错的选择,那么我的问题仍然存在:如何找到 T1、T2 值?如果这不是一个好的选择,是否有更好、更简单但有效的算法可供使用?
I am trying to implement the Canopy clustering algorithm along with K-Means. I've done some searching online that says to use Canopy clustering to get your initial starting points to feed into K-means, the problem is, in Canopy clustering, you need to specify 2 threshold values for the canopy: T1 and T2, where points in the inner threshold are strongly tied to that canopy and the points in the wider threshold are less tied to that canopy. How are these threshold, or distances from the canopy center, determined?
Problem context:
The problem I'm trying to solve is, I have a set of numbers such as [1,30] or [1,250] with set sizes of about 50. There can be duplicate elements and they can be floating point numbers as well, such as 8, 17.5, 17.5, 23, 66, ... I want to find the optimal clusters, or subsets of the set of numbers.
So, if Canopy clustering with K-means is a good choice, then my questions still stands: how do you find the T1, T2 values?. If this is not a good choice, is there a better, simpler but effective algorithm to use?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
也许天真地,我从某种频谱估计的角度看待这个问题。假设我有 10 个向量。我可以计算所有对之间的距离。在这种情况下,我会得到 45 个这样的距离。将它们绘制为不同距离范围内的直方图。例如,10 个距离在 0.1 到 0.2 之间,5 个距离在 0.2 到 0.3 之间,等等,这样您就可以了解向量之间的距离是如何分布的。根据此信息,您可以选择 T1 和 T2(例如,选择它们以便覆盖人口最多的距离范围)。
当然,这对于大型数据集来说是不切实际的 - 但您可以只采取随机样本或其他样本,以便您至少知道 T1 和 T2 的大概情况。使用 Hadoop 之类的工具,您可以对大量点进行某种事先的谱估计。如果您尝试聚类的所有传入数据都以大致相同的方式分布,那么您只需获取 T1 和 T2 一次,然后将它们修复为所有未来运行的常量。
Perhaps naively, I see the problem in terms of a sort of spectral-estimation. Suppose I have 10 vectors. I can compute the distances between all pairs. In this case I'd get 45 such distances. Plot them as a histogram in various distance ranges. E.g. 10 distances are between 0.1 and 0.2, 5 between 0.2 and 0.3 etc. and you get an idea of how the distances between vectors are distributed. From this information you can choose T1 and T2 (e.g. choose them so that you cover the distance range that is the most populated).
Of course, this is not practical for a large dataset - but you could just take a random sample or something so that you at least know the ballpark of T1 and T2. Using something like Hadoop you could do some sort of prior spectral estimation on a large number of points. If all incoming data you are trying to cluster is distributed in much the same way then you cjust need to get T1 and T2 once, then fix them as constants for all future runs.
实际上,这是 Canopy 集群的大问题。选择阈值几乎与实际算法一样困难。特别是在高维度中。对于二维地理数据集,领域专家可能可以轻松定义距离阈值。但在高维数据中,您能做的最好的可能就是首先对数据样本运行 k 均值,然后根据该样本运行选择距离。
Actually that is the big issue with Canopy Clustering. Choosing the thresholds is pretty much as difficult as the actual algorithm. In particular in high dimensions. For a 2D geographic data set, a domain expert can probably define the distance thresholds easily. But in high-dimensional data, probably the best you can do is to run k-means on a sample of your data first, then choose the distances based on this sample run.