Sklearn将不同的数据视为不同的集合
因此,我有两个DataFrames DF1和DF2。
它们是完全相同的,除了两者之间的排顺序。例如:
DF1:
id | feat1 | feat2 |
1 | A | in |
2 | B | out |
3 | C | out |
DF2:
id | feat1 | feat2 |
3 | C | out |
2 | B | out |
1 | A | in |
我正在使用Kmeans在此数据上构建聚类算法。情况就是这样,每当我将这些订购的数据提供不同的pipeline.fit()时,我最终会为此获得不同的结果(不同的centroids)。 谁能向我解释一下这个问题是如何发生的,为什么要在Fit Fit修复数据之前订购数据?
在我的管道内部有一个预处理部分,其中包括:
('selector', FunctionTransformer(_select_data)),
('transformer',
FunctionTransformer(np.log1p,inverse_func=_invert_log_transform,
check_inverse=False)),
('scaler', QuantileTransformer(random_state=42)),
('pca', PCA(random_state=42))
这是我开始注意到差异的地方 - 转换的数据已经完全不同。如果我使用订购的数据集,则它的出现相同。
So I have two dataframes, df1 and df2.
They are exactly the same, except the order of rows is mixed between the two. For example:
df1:
id | feat1 | feat2 |
1 | A | in |
2 | B | out |
3 | C | out |
df2:
id | feat1 | feat2 |
3 | C | out |
2 | B | out |
1 | A | in |
I am building clustering algorithm with kmeans on this data. It is the case, that whenever I feed this differently ordered data to pipeline.fit(), I end up getting different results for this (different centroids).
Can anyone explain me how this issue happens and why ordering the data before fit fixes it?
Inside my pipeline there is a preprocessing part, which consists of:
('selector', FunctionTransformer(_select_data)),
('transformer',
FunctionTransformer(np.log1p,inverse_func=_invert_log_transform,
check_inverse=False)),
('scaler', QuantileTransformer(random_state=42)),
('pca', PCA(random_state=42))
And this is where I start noticing the difference - transformed data is already quite different. If I use ordered dataset, then it comes out the same.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论