如何在kmeans之前添加特征的重要性系数?

发布于 2025-01-09 18:32:59 字数 1380 浏览 0 评论 0原文

假设我有给定的数据框

   feature_1  feature_2  feature_3  feature_4  feature_5  feature_6  feature_7  feature_8
0   0.862874   0.392938   0.669744   0.939903   0.382574   0.780595   0.049201   0.627703
1   0.942322   0.676181   0.223476   0.102698   0.620883   0.834038   0.966355   0.554645
2   0.940375   0.310532   0.975096   0.600778   0.893220   0.282508   0.837575   0.112575
3   0.868902   0.818175   0.102860   0.936395   0.406088   0.619990   0.913905   0.597607
4   0.143344   0.207751   0.835707   0.414900   0.360534   0.525631   0.228751   0.294437
5   0.339856   0.501197   0.671033   0.302202   0.406512   0.997044   0.080621   0.068071
6   0.521056   0.343654   0.812553   0.393159   0.217987   0.247602   0.671783   0.254299
7   0.594744   0.180041   0.884603   0.578050   0.441461   0.176732   0.569595   0.391923
8   0.402864   0.062175   0.565858   0.349415   0.106725   0.323310   0.153594   0.277930
9   0.480539   0.540283   0.248376   0.252237   0.229181   0.092273   0.546501   0.201396

,我想在这些行中找到集群。为此,我想使用 Kmeans。但是,我希望通过更加重视 [feature_1feature_2] 而不是数据框中的其他功能来找到聚类。 假设 [feature_1feature_2] 的重要性系数为 0.5,其余特征的重要性系数为 0.5

我考虑过使用 PCA 将 [feature_3, ..., feature_8] 转换为单个列。通过这样做,我想 Kmeans 对单个特征的重要性会低于对 6 个独立特征的重要性。

这是个好主意吗?您是否有更好的方法将此信息提供给算法?

Lets say I have the given dataframe

   feature_1  feature_2  feature_3  feature_4  feature_5  feature_6  feature_7  feature_8
0   0.862874   0.392938   0.669744   0.939903   0.382574   0.780595   0.049201   0.627703
1   0.942322   0.676181   0.223476   0.102698   0.620883   0.834038   0.966355   0.554645
2   0.940375   0.310532   0.975096   0.600778   0.893220   0.282508   0.837575   0.112575
3   0.868902   0.818175   0.102860   0.936395   0.406088   0.619990   0.913905   0.597607
4   0.143344   0.207751   0.835707   0.414900   0.360534   0.525631   0.228751   0.294437
5   0.339856   0.501197   0.671033   0.302202   0.406512   0.997044   0.080621   0.068071
6   0.521056   0.343654   0.812553   0.393159   0.217987   0.247602   0.671783   0.254299
7   0.594744   0.180041   0.884603   0.578050   0.441461   0.176732   0.569595   0.391923
8   0.402864   0.062175   0.565858   0.349415   0.106725   0.323310   0.153594   0.277930
9   0.480539   0.540283   0.248376   0.252237   0.229181   0.092273   0.546501   0.201396

And I would like to find clusters in these rows. To do so, I want to use Kmeans. However, I would like to find clusters by giving more importance to [feature_1, feature_2] than to the other features in the dataframe.
Lets say an importance coefficient of 0.5 for [feature_1, feature_2] , and 0.5 for the remaining features.

I thought about transforming [feature_3, ..., feature_8] into a single column by using PCA. By doing so, I imagine that the Kmeans would give less importance to a single feature than to 6 separated features.

Is it a good idea ? Do you see better ways of giving this information to the algorithm ?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

山田美奈子 2025-01-16 18:32:59

Kmeans 的作用是尝试找到质心,并将点分配给那些与质心具有最小欧氏距离的质心。当最小化欧几里德距离或将它们用作机器学习中的损失函数时,通常应该确保不同的特征具有相同的尺度。否则,较大的特征将在寻找最近点时占主导地位。这就是为什么我们通常在训练模型之前进行一些缩放。

然而,在你的情况下,你可以通过首先使用一些 minmax 或 standardscaler 将所有特征带到相同的比例,然后将前 2 个特征放大一个因子 > 。 1 或将剩余 6 个特征缩小 <1 倍。 1.

What Kmeans does is it tries to find centroids and assigns points to those centroids that have the smallest euclidean distance to the centroid. When minimizing euclidean distances or using them as loss functions in machine learning, one should in general make sure that different features have the same scale. Otherwise larger features would dominate in finding the closest points. That's why we normally do some scaling before training our models.

However, in your case, you could make use of that by first bringing all features onto the same scale using some minmax or standarscaler, and after that either scale up the first 2 features by a factor > 1 or scale down the remaining 6 features by a factor < 1.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文