如何在kmeans之前添加特征的重要性系数？

发布于 2025-01-09 18:32:59 字数 1380 浏览 0 评论 0原文

假设我有给定的数据框

   feature_1  feature_2  feature_3  feature_4  feature_5  feature_6  feature_7  feature_8
0   0.862874   0.392938   0.669744   0.939903   0.382574   0.780595   0.049201   0.627703
1   0.942322   0.676181   0.223476   0.102698   0.620883   0.834038   0.966355   0.554645
2   0.940375   0.310532   0.975096   0.600778   0.893220   0.282508   0.837575   0.112575
3   0.868902   0.818175   0.102860   0.936395   0.406088   0.619990   0.913905   0.597607
4   0.143344   0.207751   0.835707   0.414900   0.360534   0.525631   0.228751   0.294437
5   0.339856   0.501197   0.671033   0.302202   0.406512   0.997044   0.080621   0.068071
6   0.521056   0.343654   0.812553   0.393159   0.217987   0.247602   0.671783   0.254299
7   0.594744   0.180041   0.884603   0.578050   0.441461   0.176732   0.569595   0.391923
8   0.402864   0.062175   0.565858   0.349415   0.106725   0.323310   0.153594   0.277930
9   0.480539   0.540283   0.248376   0.252237   0.229181   0.092273   0.546501   0.201396

，我想在这些行中找到集群。为此，我想使用 Kmeans。但是，我希望通过更加重视 [feature_1、feature_2] 而不是数据框中的其他功能来找到聚类。假设 [feature_1、feature_2] 的重要性系数为 0.5，其余特征的重要性系数为 0.5。

我考虑过使用 PCA 将 [feature_3, ..., feature_8] 转换为单个列。通过这样做，我想 Kmeans 对单个特征的重要性会低于对 6 个独立特征的重要性。

这是个好主意吗？您是否有更好的方法将此信息提供给算法？

原文

Lets say I have the given dataframe

   feature_1  feature_2  feature_3  feature_4  feature_5  feature_6  feature_7  feature_8
0   0.862874   0.392938   0.669744   0.939903   0.382574   0.780595   0.049201   0.627703
1   0.942322   0.676181   0.223476   0.102698   0.620883   0.834038   0.966355   0.554645
2   0.940375   0.310532   0.975096   0.600778   0.893220   0.282508   0.837575   0.112575
3   0.868902   0.818175   0.102860   0.936395   0.406088   0.619990   0.913905   0.597607
4   0.143344   0.207751   0.835707   0.414900   0.360534   0.525631   0.228751   0.294437
5   0.339856   0.501197   0.671033   0.302202   0.406512   0.997044   0.080621   0.068071
6   0.521056   0.343654   0.812553   0.393159   0.217987   0.247602   0.671783   0.254299
7   0.594744   0.180041   0.884603   0.578050   0.441461   0.176732   0.569595   0.391923
8   0.402864   0.062175   0.565858   0.349415   0.106725   0.323310   0.153594   0.277930
9   0.480539   0.540283   0.248376   0.252237   0.229181   0.092273   0.546501   0.201396

And I would like to find clusters in these rows. To do so, I want to use Kmeans. However, I would like to find clusters by giving more importance to [feature_1, feature_2] than to the other features in the dataframe.
Lets say an importance coefficient of 0.5 for [feature_1, feature_2] , and 0.5 for the remaining features.

I thought about transforming [feature_3, ..., feature_8] into a single column by using PCA. By doing so, I imagine that the Kmeans would give less importance to a single feature than to 6 separated features.

Is it a good idea ? Do you see better ways of giving this information to the algorithm ?

分享到QQ

分享到微博