自定义 sklearn Pipeline 来转换 X 和 y
我创建了自己的文本处理自定义管道。在 .transform()
方法中,我想如果没有标记则删除目标行。
class SpacyVectorizer(BaseEstimator, TransformerMixin):
def __init__(
self,
alpha_only: bool = True,
lemmatize: bool = True,
remove_stopwords: bool = True,
case_fold: bool = True,
):
self.alpha_only = alpha_only
self.lemmatize = lemmatize
self.remove_stopwords = remove_stopwords
self.case_fold = case_fold
self.nlp = spacy.load(
name='en_core_web_sm',
disable=["parser", "ner"]
)
def fit(self, X, y=None):
return self
def transform(self, X, y):
# Bag-of-Words matrix
bow_matrix = []
# Iterate over documents in SpaCy pipeline
for i, doc in enumerate(nlp.pipe(X)):
# Words array
words = []
# Tokenize document
for token in doc:
# Remove non-alphanumeric tokens
if self.alpha_only and not token.is_alpha:
continue
# Stopword removal
if self.remove_stopwords and token.is_stop:
continue
# Lemmatization
if self.lemmatize:
token = token.lemma_
# Case folding
if self.case_fold:
token = str(token).casefold()
# Append token to words array
words.append(token)
# Update the Bow representation
if words:
# Preprocessed document
new_doc = ' '.join(words)
# L2-normalized vector of preprocessed document
word_vec = nlp(new_doc).vector
else:
# Remove target label
y.drop(y.index[i], inplace=True)
# Update the BoW matrix
bow_matrix.append(word_vec)
# Return BoW matrix
return bow_matrix
不幸的是,因为我无法将 y 向量传递给 .transform() 方法,所以它不起作用。
如何强制管道传递 X
和 y
参数? 关于如何做到这一点还有其他解决方法吗? 我不想通过 .fit_transform()
传递 y
,因为不应拟合测试数据。
I created my own custom pipeline for text processing. Inside the .transform()
method, I want to remove the target row if there are no tokens.
class SpacyVectorizer(BaseEstimator, TransformerMixin):
def __init__(
self,
alpha_only: bool = True,
lemmatize: bool = True,
remove_stopwords: bool = True,
case_fold: bool = True,
):
self.alpha_only = alpha_only
self.lemmatize = lemmatize
self.remove_stopwords = remove_stopwords
self.case_fold = case_fold
self.nlp = spacy.load(
name='en_core_web_sm',
disable=["parser", "ner"]
)
def fit(self, X, y=None):
return self
def transform(self, X, y):
# Bag-of-Words matrix
bow_matrix = []
# Iterate over documents in SpaCy pipeline
for i, doc in enumerate(nlp.pipe(X)):
# Words array
words = []
# Tokenize document
for token in doc:
# Remove non-alphanumeric tokens
if self.alpha_only and not token.is_alpha:
continue
# Stopword removal
if self.remove_stopwords and token.is_stop:
continue
# Lemmatization
if self.lemmatize:
token = token.lemma_
# Case folding
if self.case_fold:
token = str(token).casefold()
# Append token to words array
words.append(token)
# Update the Bow representation
if words:
# Preprocessed document
new_doc = ' '.join(words)
# L2-normalized vector of preprocessed document
word_vec = nlp(new_doc).vector
else:
# Remove target label
y.drop(y.index[i], inplace=True)
# Update the BoW matrix
bow_matrix.append(word_vec)
# Return BoW matrix
return bow_matrix
Unfortunately, because I cannot pass the y
vector to the .transform()
method, it does not work.
How can I force the pipeline to pass both X
and y
parameters?
Is there any other workaround on how to do it?
I don't want to pass y
via .fit_transform()
, because test data shouldn't be fitted.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这里你写了 y = None,这意味着如果你没有传递任何 y 值,那么它将采用默认值 None。
为了强制管道传递 ay 值,你应该写
如果你这样做,那么你必须传递 ay 值,否则它将返回一个错误
我正在谈论的空间问题
你得到的错误可能是因为空间问题,因为 self 可能采用 x 值,而 X 参数可能采用 y 值
Here you have written y = None, which means if you aren't passing any y value then it's taking a default value as None.
In order to force a pipeline to pass a y value u should write
If you do this then you have to pass a y value, else it will return a error
the space problem I am talking about
The error you are getting might be because of the space problem, as self might be taking x value and X parameter might be taking y value