从Sklearlin列变压器中提取功能名称

发布于 2025-01-18 13:35:24 字数 1610 浏览 0 评论 0原文

我正在使用 sklearn.pipeline 来转换我的特征并拟合模型,所以我的一般流程如下所示:列转换器 -->通用管道-->模型。我希望能够从列转换器中提取特征名称(从下一步开始,通用管道对所有列应用相同的转换,例如 nan_to_zero)并将它们用于模型可解释性(例如特征重要性)。我还希望它也能与自定义转换器类一起使用。

设置如下:

import numpy as np
import pandas as pd
from sklearn import compose, pipeline, preprocessing

df = pd.DataFrame({"a": [1, 2, 3], "b": [1, 2, 3], "c": ["x", "y", "z"]})
column_transformer = compose.make_column_transformer(
   (preprocessing.StandardScaler(), ["a", "b"]),
   (preprocessing.KBinsDiscretizer(n_bins=2, encode="ordinal"), ["a"]),
   (preprocessing.OneHotEncoder(), ["c"]),
)
pipe = pipeline.Pipeline([
   ("transform", column_transformer),
   ("nan_to_num", preprocessing.FunctionTransformer(np.nan_to_num, validate=False))
])
pipe.fit_transform(df)  # returns a numpy array

到目前为止,我已经尝试使用 get_feature_names_out,例如:

pipe.named_steps["transform"].get_feature_names_out()

但我遇到了 get_feature_names_out()需要 1 个位置参数,但给出了 2 个,不确定发生了什么,但整个过程感觉不对。有更好的方法吗?

编辑:非常感谢@amiola回答这个问题,这确实是问题所在。我只是想为后人添加另一个重要的观点:我在自己的自定义管道中遇到了其他问题,并且收到错误 get_feature_names_out() gets 1positional argument but 2 were given。事实证明,除了 KBinsDiscretizer 之外,我的自定义转换器类中还存在另一个错误。我实现了 get_feature_names_out 方法,但它不接受我这边的任何参数,这就是问题所在。如果您遇到类似问题,请确保此方法具有以下签名: get_feature_names_out(self, input_features) ->列表[str]。

I'm using sklearn.pipeline to transform my features and fit a model, so my general flow looks like this: column transformer --> general pipeline --> model. I would like to be able to extract feature names from the column transformer (since the following step, general pipeline applies the same transformation to all columns, e.g. nan_to_zero) and use them for model explainability (e.g. feature importance). I'd also like it to work with custom transformer classes too.

Here is the set up:

import numpy as np
import pandas as pd
from sklearn import compose, pipeline, preprocessing

df = pd.DataFrame({"a": [1, 2, 3], "b": [1, 2, 3], "c": ["x", "y", "z"]})
column_transformer = compose.make_column_transformer(
   (preprocessing.StandardScaler(), ["a", "b"]),
   (preprocessing.KBinsDiscretizer(n_bins=2, encode="ordinal"), ["a"]),
   (preprocessing.OneHotEncoder(), ["c"]),
)
pipe = pipeline.Pipeline([
   ("transform", column_transformer),
   ("nan_to_num", preprocessing.FunctionTransformer(np.nan_to_num, validate=False))
])
pipe.fit_transform(df)  # returns a numpy array

So far I've tried using get_feature_names_out, e.g.:

pipe.named_steps["transform"].get_feature_names_out()

But I'm running into get_feature_names_out() takes 1 positional argument but 2 were given, not sure what's going on but this entire process doesn't feel right. Is there a better way to do it?

EDIT: A big thank you to @amiola for answering the question, that was indeed the problem. I just wanted to add another important point for posterity: I was having other problems with my own custom pipeline and I was getting an error get_feature_names_out() takes 1 positional argument but 2 were given. So it turns out, aside from the KBinsDiscretizer there was another bug in my custom transformer classes. I implemented the get_feature_names_out method, but it was not accepting any parameter on my end and that was the problem. If you run into similar issues, then make sure that this method has the following signature: get_feature_names_out(self, input_features) -> List[str].

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

听闻余生 2025-01-25 13:35:24

看来问题是由传递给 KBinsDiscretizer 构造函数的 encode="ordinal" 参数生成的。该错误在 GitHub 问题 #22731GitHub 问题#22841 并通过 PR #22735 解决。

事实上,您可能会发现,通过指定 encode="onehot" 您可能会得到一致的结果:

import numpy as np
import pandas as pd
from sklearn import compose, pipeline, preprocessing

df = pd.DataFrame({"a": [1, 2, 3], "b": [1, 2, 3], "c": ["x", "y", "z"]})
column_transformer = compose.make_column_transformer(
   (preprocessing.StandardScaler(), ["a", "b"]),
   (preprocessing.KBinsDiscretizer(n_bins=2, encode="onehot"), ["a"]),
   (preprocessing.OneHotEncoder(), ["c"]),
)
pipe = pipeline.Pipeline([
   ("transform", column_transformer),
   ("nan_to_num", preprocessing.FunctionTransformer(np.nan_to_num, validate=False))
])
pipe.fit_transform(df) 

pipe.named_steps['transform'].get_feature_names_out()

# array(['standardscaler__a', 'standardscaler__b', 'kbinsdiscretizer__a_0.0', 'kbinsdiscretizer__a_1.0','onehotencoder__c_x', 'onehotencoder__c_y', 'onehotencoder__c_z'], dtype=object)

除此之外,对我来说一切都很好。

最终,显然,甚至安装夜间构建,我仍然遇到同样的错误。

It seems the problem is generated by the encode="ordinal" parameter passed to the KBinsDiscretizer constructor. The bug is tracked in GitHub issue #22731 and GitHub issue #22841 and solved with PR #22735.

Indeed, you might see that by specifying encode="onehot" you might get a consistent result:

import numpy as np
import pandas as pd
from sklearn import compose, pipeline, preprocessing

df = pd.DataFrame({"a": [1, 2, 3], "b": [1, 2, 3], "c": ["x", "y", "z"]})
column_transformer = compose.make_column_transformer(
   (preprocessing.StandardScaler(), ["a", "b"]),
   (preprocessing.KBinsDiscretizer(n_bins=2, encode="onehot"), ["a"]),
   (preprocessing.OneHotEncoder(), ["c"]),
)
pipe = pipeline.Pipeline([
   ("transform", column_transformer),
   ("nan_to_num", preprocessing.FunctionTransformer(np.nan_to_num, validate=False))
])
pipe.fit_transform(df) 

pipe.named_steps['transform'].get_feature_names_out()

# array(['standardscaler__a', 'standardscaler__b', 'kbinsdiscretizer__a_0.0', 'kbinsdiscretizer__a_1.0','onehotencoder__c_x', 'onehotencoder__c_y', 'onehotencoder__c_z'], dtype=object)

Besides this, everything seems fine to me.

Eventually, apparently, even installing the nightly builds, I still get the same error.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文