Sklearn将不同的数据视为不同的集合

发布于 2025-02-12 16:14:10 字数 884 浏览 1 评论 0原文

因此，我有两个DataFrames DF1和DF2。

它们是完全相同的，除了两者之间的排顺序。例如：

DF1：

 id   |  feat1   |  feat2   | 
 1    |    A     |    in    |
 2    |    B     |   out    |
 3    |    C     |   out    |

DF2：

 id   |  feat1   |  feat2   | 
 3    |    C     |   out    |
 2    |    B     |   out    |
 1    |    A     |   in     |

我正在使用Kmeans在此数据上构建聚类算法。情况就是这样，每当我将这些订购的数据提供不同的pipeline.fit（）时，我最终会为此获得不同的结果（不同的centroids）。谁能向我解释一下这个问题是如何发生的，为什么要在Fit Fit修复数据之前订购数据？

在我的管道内部有一个预处理部分，其中包括：

('selector', FunctionTransformer(_select_data)),
('transformer', 
        FunctionTransformer(np.log1p,inverse_func=_invert_log_transform, 
        check_inverse=False)),
('scaler', QuantileTransformer(random_state=42)),
('pca', PCA(random_state=42))

这是我开始注意到差异的地方 - 转换的数据已经完全不同。如果我使用订购的数据集，则它的出现相同。

原文

So I have two dataframes, df1 and df2.

They are exactly the same, except the order of rows is mixed between the two. For example:

df1:

 id   |  feat1   |  feat2   | 
 1    |    A     |    in    |
 2    |    B     |   out    |
 3    |    C     |   out    |

df2:

 id   |  feat1   |  feat2   | 
 3    |    C     |   out    |
 2    |    B     |   out    |
 1    |    A     |   in     |

I am building clustering algorithm with kmeans on this data. It is the case, that whenever I feed this differently ordered data to pipeline.fit(), I end up getting different results for this (different centroids).
Can anyone explain me how this issue happens and why ordering the data before fit fixes it?

Inside my pipeline there is a preprocessing part, which consists of:

('selector', FunctionTransformer(_select_data)),
('transformer', 
        FunctionTransformer(np.log1p,inverse_func=_invert_log_transform, 
        check_inverse=False)),
('scaler', QuantileTransformer(random_state=42)),
('pca', PCA(random_state=42))

And this is where I start noticing the difference - transformed data is already quite different. If I use ordered dataset, then it comes out the same.

分享到QQ

分享到微博