我在 100 维空间中有 2,000,000 个点。如何将它们聚类为 K(例如 1000)个簇?
问题来了如下。我有 M 个图像,并为每个图像提取 N 个特征,每个特征的维度为 L。因此,我有 M*N 个特征(对于我的情况为 2,000,000),每个特征具有 L 维(对于我的情况为 100)。我需要将这些 M*N 特征聚类成 K 个簇。我该怎么做呢?谢谢。
The problem comes as follows. I have M images and extract N features for each image, and the dimensionality of each feature is L. Thus, I have M*N features (2,000,000 for my case) and each feature has L dimensionality (100 for my case). I need to cluster these M*N features into K clusters. How can I do it? Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您想要 1000 个图像簇、特征簇或(图像、特征)对吗?
无论如何,听起来你必须减少数据
并使用更简单的方法。
一种可能性是两次通过 K 簇:
a) 将 200 万个数据点分成 32 个簇,
b) 将其中每一个分成 32 个以上。
如果这有效,则生成的 32^2 = 1024 个簇可能足以满足您的目的。
那么,你真的需要 100 个坐标吗?
你能猜出 20 个最重要的吗?
或者只是尝试 20 的随机子集?
有大量文献:Google
+image“降维”
给出了约 70000 次点击。Do you want 1000 clusters of images, or of features, or of (image, feature) pairs ?
In any case, it sounds as though you'll have to reduce the data
and use simpler methods.
One possibility is two-pass K-cluster:
a) split the 2 million data points into 32 clusters,
b) split each of these into 32 more.
If this works, the resulting 32^2 = 1024 clusters might be good enough for your purpose.
Then, do you really need 100 coordinates ?
Could you guess the 20 most important ones,
or just try random subsets of 20 ?
There's a huge literature: Google
+image "dimension reduction"
gives ~ 70000 hits.您已将问题标记为“k-means”。为什么不能使用 k 均值?这是效率问题吗? (我个人只在二维中使用过k-means)或者是如何编码k-means算法的问题?
您的值是离散的(例如类别)还是连续的(例如坐标值)?如果是后者,那么根据我的理解,k-means 应该没问题。对于离散值的聚类,则需要不同的算法 - 也许是层次聚类?
You've tagged the question "k-means". Why can't you use k-means? Is this a question of efficiency? (personally I've only used k-means in 2 dimensions) Or is it a question of how to encode the k-means algorithm?
Are your values discrete (eg. categories) or continuous (eg. a coordinate value)? If the latter, then k-means should be fine in my understanding. For the clustering of discrete values then a different algorithm will be required - perhaps hierarchical clustering?
LMW-tree 项目中的 EM-tree 和 K-tree 算法可以聚类如此大的问题更大。我们最新的结果是将 7.33 亿个网页聚类成 600,000 个集群。 EM 树还有一个流式变体,其中每次迭代的数据集都是从磁盘流式传输的。
The EM-tree and K-tree algorithms in the LMW-tree project can cluster problems this big and larger. Our most recent result is clustering 733 million web pages into 600,000 clusters. There is also a streaming variant of the EM-tree where the dataset is streamed from disk for each iteration.
对数百万个点进行聚类时的一个好技巧是对它们进行采样,对样本进行聚类,然后将剩余的点添加到现有样本中
A good trick when clustering millions of points is to sample them, cluster the sample, and then add the remaining points to the existing sample