当前位置：文江博客话题详情

如何使用非常小的数据集对特征进行加权以实现更好的聚类？

发布于 2024-11-24 03:29:06 字数 216 浏览 5 评论 0原文

我正在开发一个程序，该程序接受特征空间（1000+维）中的几个（<50）高维点，并通过递归地使用标准 k 聚类对它们执行分层聚类。

我的问题是，在任何一个 k 聚类过程中，高维表示的不同部分都是冗余的。我知道这个问题是在特征提取、选择或加权的保护下出现的。

一般来说，在选择特定的特征提取/选择/加权算法时要考虑什么？具体来说，在我的情况下，哪种算法是准备数据进行聚类的最佳方法？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我恋#小黄人 2024-12-01 03:29:06

查看这篇论文：

Witten DM 和 R Tibshirani (2010) 聚类中特征选择的框架。美国统计协会杂志 105(490)：713-726。

以及 Friedman 的相关论文 COSA。他们都深入讨论了这些问题。

回复收藏 0 原文

a√萤火虫的光℡ 2024-12-01 03:29:06

我建议结合使用基于 PCA 的特征选择和 k 均值。

找到您的主要成分并按重量排序。并在层次结构的每个深度消耗这些权重。

例如，假设您有一个具有四个深度的集群层次结构，您将获得如下所示的组件权重：

W1: 0.32
W2: 0.20
W3: 0.18
W4: 0.09
...
W1000: 0.00

我们希望为每个深度从顶部消耗 1/N 的权重，其中 N 是深度计数。此处将N 视为4。第一个组件的 0.25 被消耗，我们达到：

W1: 0.07*
W2: 0.20
W3: 0.18
W4: 0.09
...
W1000: 0.00

第一个组件的新分数变为 0.32-0.25=0.07。在第二次迭代中，我们再次消耗顶部的 0.25。

W1: 0.00*
W2: 0.02*
W3: 0.18
W4: 0.09
...
W1000: 0.00

第三次迭代是：

W1: 0.00
W2: 0.00*
W3: 0.00*
W4: 0.04*
...
W1000: 0.00

第四次迭代使用其余部分，其中权重高达 0.25。

在每次迭代中，我们仅使用我们消耗权重的特征。例如，我们在第二次迭代中仅使用 KLT 之后的特征的 PC1 和 PC2，因为这些是我们消耗权重的唯一组件。因此，每次迭代要聚类的组件变为：

Iteration 1: PC1
Iteration 2: PC1, PC2
Iteration 3: PC2, PC3, PC4
Iteration 4: PC4, ... PC1000

您可以将最终权重消耗设定为小于 1.0，并为此目的以更少的权重进行迭代。这实际上与在聚类之前过滤掉超出目标重量的所有组件以减少维度相同。

最后，我不知道这种方法是否有一个名称。使用 PCA 来解决无监督问题感觉很自然。您还可以在第一次迭代后尝试监督特征选择，因为您手头有集群标签。

I would suggest a combination of PCA based feature selection and k-means.

Find your principal components and order them by weight. And consume those weights at each depth of you hierarchy.

For example, let's assume you have a cluster hierarchy of four depths abd you obtain component weights like this:

W1: 0.32
W2: 0.20
W3: 0.18
W4: 0.09
...
W1000: 0.00

We want to consume a weight of 1/N from the top for each depth, where N is the depth count. Taking N as 4 here. 0.25 of the first component gets consumed and we reach:

W1: 0.07*
W2: 0.20
W3: 0.18
W4: 0.09
...
W1000: 0.00

New score for the first component becomes 0.32-0.25=0.07. In the second iteration, we consume the top 0.25 again.

W1: 0.00*
W2: 0.02*
W3: 0.18
W4: 0.09
...
W1000: 0.00

The third iteration is:

W1: 0.00
W2: 0.00*
W3: 0.00*
W4: 0.04*
...
W1000: 0.00

And the fourth iteration uses the rest where weights some up to 0.25.

At each iteration we use only features whose weight we consume. For example we only use PC1 and PC2 of the features after KLT on the second iteration, since those are the only components whose weights we consume. Thus, components to cluster for each iteration become:

Iteration 1: PC1
Iteration 2: PC1, PC2
Iteration 3: PC2, PC3, PC4
Iteration 4: PC4, ... PC1000

You may target a final weight consumption that is less than 1.0 and iterate in less amount of weights for this purpose. This is effectively same as filtering out all components beyond your target weight for dimension reduction prior to clustering.

Finally, I don't know if there is a name for this approach. It just feels natural to use PCA for unsupervised problems. You may also try supervised feature selection after the first iteration, since you have cluster labels at hand.

回复收藏 0 原文

~没有更多了~