如何使用非常小的数据集对特征进行加权以实现更好的聚类?
我正在开发一个程序,该程序接受特征空间(1000+维)中的几个(<50)高维点,并通过递归地使用标准 k 聚类对它们执行分层聚类。
我的问题是,在任何一个 k 聚类过程中,高维表示的不同部分都是冗余的。我知道这个问题是在特征提取、选择或加权的保护下出现的。
一般来说,在选择特定的特征提取/选择/加权算法时要考虑什么?具体来说,在我的情况下,哪种算法是准备数据进行聚类的最佳方法?
I'm working on a program that takes in several (<50) high dimension points in feature space (1000+ dimensions) and performing hierarchical clustering on them by recursively using standard k-clustering.
My problem is that in any one k-clustering pass, different parts of the high dimensional representation are redundant. I know this problem follows under the umbrella of either feature extraction, selection, or weighting.
In general, what does one take into account when selecting a particular feature extraction/selection/weighting algorithm? And specifically, what algorithm would be the best way to prepare my data to clustering in my situation?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
查看这篇论文:
Witten DM 和 R Tibshirani (2010) 聚类中特征选择的框架。美国统计协会杂志 105(490):713-726。
以及 Friedman 的相关论文 COSA。他们都深入讨论了这些问题。
Check out this paper:
Witten DM and R Tibshirani (2010) A framework for feature selection in clustering. Journal of the American Statistical Association 105(490): 713-726.
And the related paper COSA by Friedman. They both discuss these issues in depth.
我建议结合使用基于 PCA 的特征选择和 k 均值。
找到您的主要成分并按重量排序。并在层次结构的每个深度消耗这些权重。
例如,假设您有一个具有四个深度的集群层次结构,您将获得如下所示的组件权重:
我们希望为每个深度从顶部消耗
1/N
的权重,其中N
是深度计数。此处将N
视为4
。第一个组件的0.25
被消耗,我们达到:第一个组件的新分数变为
0.32-0.25=0.07
。在第二次迭代中,我们再次消耗顶部的0.25
。第三次迭代是:
第四次迭代使用其余部分,其中权重高达
0.25
。在每次迭代中,我们仅使用我们消耗权重的特征。例如,我们在第二次迭代中仅使用 KLT 之后的特征的 PC1 和 PC2,因为这些是我们消耗权重的唯一组件。因此,每次迭代要聚类的组件变为:
您可以将最终权重消耗设定为小于 1.0,并为此目的以更少的权重进行迭代。这实际上与在聚类之前过滤掉超出目标重量的所有组件以减少维度相同。
最后,我不知道这种方法是否有一个名称。使用 PCA 来解决无监督问题感觉很自然。您还可以在第一次迭代后尝试监督特征选择,因为您手头有集群标签。
I would suggest a combination of PCA based feature selection and k-means.
Find your principal components and order them by weight. And consume those weights at each depth of you hierarchy.
For example, let's assume you have a cluster hierarchy of four depths abd you obtain component weights like this:
We want to consume a weight of
1/N
from the top for each depth, whereN
is the depth count. TakingN
as4
here.0.25
of the first component gets consumed and we reach:New score for the first component becomes
0.32-0.25=0.07
. In the second iteration, we consume the top0.25
again.The third iteration is:
And the fourth iteration uses the rest where weights some up to
0.25
.At each iteration we use only features whose weight we consume. For example we only use PC1 and PC2 of the features after KLT on the second iteration, since those are the only components whose weights we consume. Thus, components to cluster for each iteration become:
You may target a final weight consumption that is less than
1.0
and iterate in less amount of weights for this purpose. This is effectively same as filtering out all components beyond your target weight for dimension reduction prior to clustering.Finally, I don't know if there is a name for this approach. It just feels natural to use PCA for unsupervised problems. You may also try supervised feature selection after the first iteration, since you have cluster labels at hand.