我们应该使用 k-means++而不是 k 均值?
k-means++ 算法有助于原始 k 的以下两点-means算法:
- 原始的k-means算法在输入大小上具有超多项式的最坏情况运行时间,而k-means++声称是O(log k)。
- 与最佳聚类相比,所找到的近似值在目标函数方面可能会产生不太令人满意的结果。
但是 k-means++ 有什么缺点吗?从现在开始我们应该一直使用它而不是 k 均值吗?
The k-means++ algorithm helps in two following points of the original k-means algorithm:
- The original k-means algorithm has the worst case running time of super-polynomial in input size, while k-means++ has claimed to be O(log k).
- The approximation found can yield a not so satisfactory result with respect to objective function compared to the optimal clustering.
But are there any drawbacks of k-means++? Should we always used it instead of k-means from now on?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
![扫码二维码加入Web技术交流群](/public/img/jiaqun_03.jpg)
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
没有人声称k-means++运行时间为 O(lg k);它的解质量是 O(lg k) - 与最优解竞争。 k-means++ 和称为 Lloyd 算法的通用方法都是 NP 难优化问题的近似。
我不确定 k-means++ 最坏情况下的运行时间是多少;请注意,在 Arthur & Vassilvitskii的原始描述,算法的步骤2-4参考了Lloyd算法。他们确实声称它在实践中效果更好更快,因为它从一个更好的位置开始。
k-means++ 的缺点是:
也就是说,如果您的 k-means 库支持 k-means++,那么请务必尝试一下。
Nobody claims k-means++ runs in O(lg k) time; it's solution quality is O(lg k)-competitive with the optimal solution. Both k-means++ and the common method, called Lloyd's algorithm, are approximations to an NP-hard optimization problem.
I'm not sure what the worst case running time of k-means++ is; note that in Arthur & Vassilvitskii's original description, steps 2-4 of the algorithm refer to Lloyd's algorithm. They do claim that it works both better and faster in practice because it starts from a better position.
The drawbacks of k-means++ are thus:
That said, if your k-means library supports k-means++, then by all means try it out.
不是你的问题,而是对大 N 的任何 kmeans 方法的简单加速:
1)首先对点的 sqrt(N) 的随机样本进行 k-means
2) 然后从这些中心运行完整的 k 均值。
我发现对于 N 10000、k 20,这比 kmeans++ 快 5-10 倍,并且结果相似。
它对您的效果如何取决于 sqrt(N) 样本的效果
近似整体,以及 N、dim、k、ninit、delta ...
您的 N(数据点数量)、dim(特征数量)和 k 是多少?
用户的 N、dim、k、数据噪声、指标的巨大范围......
更不用说缺乏公共基准,使得比较方法变得困难。
添加:kmeans()和kmeanssample()的Python代码是
此处 就这样;欢迎评论。
Not your question, but an easy speedup to any kmeans method for large N:
1) first do k-means on a random sample of say sqrt(N) of the points
2) then run full k-means from those centres.
I've found this 5-10 times faster than kmeans++ for N 10000, k 20, with similar results.
How well it works for you will depend on how well a sqrt(N) sample
approximates the whole, as well as on N, dim, k, ninit, delta ...
What are your N (number of data points), dim (number of features), and k ?
The huge range in users' N, dim, k, data noise, metrics ...
not to mention the lack of public benchmarks, make it tough to compare methods.
Added: Python code for kmeans() and kmeanssample() is
here on SO; comments are welcome.