许多 FunctionTransformer 到同一列 - sklearn

发布于 01-19 06:55 字数 1547 浏览 2 评论 0原文

我只有一个输入,即用户的电子邮件,我使用

X = np.array(['[email protected]', '[email protected]'])
y = np.array([True, False])

def email_length(email) -> np.array:
    return [len(e.split('@')[0]) for e in email]

def domain_length(email) -> np.array:
    return [len(e.split('@')[-1]) for e in email]

def number_of_vouls(email) -> np.array:
    vouls = 'aeiouAEIOU'
    name = [e.split('@')[0] for e in email]
    return [sum(1 for char in name if char in vouls) for name in name]

/modules/ generated/sklearn.preprocessing.FunctionTransformer.html#sklearn.preprocessing.FunctionTransformer" rel="nofollow noreferrer">sklearn,创建后的 我将其打包在 FunctionTransformers 中的函数

email_length1 = FunctionTransformer(email_length)
domain_length1 = FunctionTransformer(domain_length)
number_of_vouls1 = FunctionTransformer(number_of_vouls)

然后创建管道

pipe = Pipeline([
        ('email_length', email_length1),
        ('domain_length', domain_length1),
        ('number_of_vouls', number_of_vouls1),
        ('classifier', LGBMClassifier())
        ])

但是当我尝试适应模型时,就像

 pipe.fit(X, y)

我有 AttributeError: 'int' object has no attribute 'split'但每当我这样做

domain_length(X)
Output : [9, 9]

I have only one input, which is email of a user and i create many different functions to create features from the email using FunctionTransformers from sklearn, example

X = np.array(['[email protected]', '[email protected]'])
y = np.array([True, False])

def email_length(email) -> np.array:
    return [len(e.split('@')[0]) for e in email]

def domain_length(email) -> np.array:
    return [len(e.split('@')[-1]) for e in email]

def number_of_vouls(email) -> np.array:
    vouls = 'aeiouAEIOU'
    name = [e.split('@')[0] for e in email]
    return [sum(1 for char in name if char in vouls) for name in name]

after creating the functions i pack it in the FunctionTransformers

email_length1 = FunctionTransformer(email_length)
domain_length1 = FunctionTransformer(domain_length)
number_of_vouls1 = FunctionTransformer(number_of_vouls)

Then i create the Pipeline

pipe = Pipeline([
        ('email_length', email_length1),
        ('domain_length', domain_length1),
        ('number_of_vouls', number_of_vouls1),
        ('classifier', LGBMClassifier())
        ])

But when i try to fit the model like

 pipe.fit(X, y)

I have AttributeError: 'int' object has no attribute 'split' but whenever i do

domain_length(X)
Output : [9, 9]

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

梦冥2025-01-26 06:55:17

管道步骤按顺序应用,因此您的第二个转换器接收电子邮件长度而不是电子邮件地址。

您可以在此处使用ColumnTransformerFeatureUnion。例如,

preproc = FeatureUnion([
        ('email_length', email_length1),
        ('domain_length', domain_length1),
        ('number_of_vouls', number_of_vouls1),
])

pipe = Pipeline([
        ('preproc', preproc),
        ('classifier', LGBMClassifier())
        ])

由于函数中返回的形状,您将收到一个新错误,但将它们包装到 numpy 数组并对其进行整形似乎可以工作:

def email_length(email) -> np.array:
    return np.array([len(e.split('@')[0]) for e in email]).reshape(-1, 1)

Pipeline steps are applied sequentially, so your second transformer is receiving the email lengths rather than the email addresses.

You can use a ColumnTransformer or FeatureUnion here. For example,

preproc = FeatureUnion([
        ('email_length', email_length1),
        ('domain_length', domain_length1),
        ('number_of_vouls', number_of_vouls1),
])

pipe = Pipeline([
        ('preproc', preproc),
        ('classifier', LGBMClassifier())
        ])

You'll get a new error because of the shape of the returns in your functions, but wrapping those up to numpy arrays and shaping them appears to work:

def email_length(email) -> np.array:
    return np.array([len(e.split('@')[0]) for e in email]).reshape(-1, 1)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文