自定义 sklearn Pipeline 来转换 X 和 y

发布于 2025-01-11 12:47:51 字数 1979 浏览 1 评论 0原文

我创建了自己的文本处理自定义管道。在 .transform() 方法中，我想如果没有标记则删除目标行。

class SpacyVectorizer(BaseEstimator, TransformerMixin):
  def __init__(
      self, 
      alpha_only: bool = True,
      lemmatize: bool = True, 
      remove_stopwords: bool = True, 
      case_fold: bool = True,
    ):
    self.alpha_only = alpha_only
    self.lemmatize = lemmatize
    self.remove_stopwords = remove_stopwords
    self.case_fold = case_fold
    self.nlp = spacy.load(
      name='en_core_web_sm', 
      disable=["parser", "ner"]
    )
  
  def fit(self, X, y=None):
    return self
  
  def transform(self, X, y):
    # Bag-of-Words matrix
    bow_matrix = []
    
    # Iterate over documents in SpaCy pipeline 
    for i, doc in enumerate(nlp.pipe(X)):
      # Words array
      words = []

      # Tokenize document
      for token in doc:

        # Remove non-alphanumeric tokens
        if self.alpha_only and not token.is_alpha:
          continue
        
        # Stopword removal
        if self.remove_stopwords and token.is_stop:
          continue
        
        # Lemmatization
        if self.lemmatize:
          token = token.lemma_
        
        # Case folding
        if self.case_fold:
          token = str(token).casefold()

        # Append token to words array
        words.append(token)
      
      # Update the Bow representation
      if words:
        # Preprocessed document
        new_doc = ' '.join(words)
        
        # L2-normalized vector of preprocessed document
        word_vec = nlp(new_doc).vector
      
      else:
        # Remove target label
        y.drop(y.index[i], inplace=True)

      # Update the BoW matrix
      bow_matrix.append(word_vec)

    # Return BoW matrix  
    return bow_matrix

不幸的是，因为我无法将 y 向量传递给 .transform() 方法，所以它不起作用。

如何强制管道传递 X 和 y 参数？关于如何做到这一点还有其他解决方法吗？我不想通过 .fit_transform() 传递 y，因为不应拟合测试数据。

原文

I created my own custom pipeline for text processing. Inside the .transform() method, I want to remove the target row if there are no tokens.

class SpacyVectorizer(BaseEstimator, TransformerMixin):
  def __init__(
      self, 
      alpha_only: bool = True,
      lemmatize: bool = True, 
      remove_stopwords: bool = True, 
      case_fold: bool = True,
    ):
    self.alpha_only = alpha_only
    self.lemmatize = lemmatize
    self.remove_stopwords = remove_stopwords
    self.case_fold = case_fold
    self.nlp = spacy.load(
      name='en_core_web_sm', 
      disable=["parser", "ner"]
    )
  
  def fit(self, X, y=None):
    return self
  
  def transform(self, X, y):
    # Bag-of-Words matrix
    bow_matrix = []
    
    # Iterate over documents in SpaCy pipeline 
    for i, doc in enumerate(nlp.pipe(X)):
      # Words array
      words = []

      # Tokenize document
      for token in doc:

        # Remove non-alphanumeric tokens
        if self.alpha_only and not token.is_alpha:
          continue
        
        # Stopword removal
        if self.remove_stopwords and token.is_stop:
          continue
        
        # Lemmatization
        if self.lemmatize:
          token = token.lemma_
        
        # Case folding
        if self.case_fold:
          token = str(token).casefold()

        # Append token to words array
        words.append(token)
      
      # Update the Bow representation
      if words:
        # Preprocessed document
        new_doc = ' '.join(words)
        
        # L2-normalized vector of preprocessed document
        word_vec = nlp(new_doc).vector
      
      else:
        # Remove target label
        y.drop(y.index[i], inplace=True)

      # Update the BoW matrix
      bow_matrix.append(word_vec)

    # Return BoW matrix  
    return bow_matrix

Unfortunately, because I cannot pass the y vector to the .transform() method, it does not work.

How can I force the pipeline to pass both X and y parameters?
Is there any other workaround on how to do it?
I don't want to pass y via .fit_transform(), because test data shouldn't be fitted.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

说好的呢 2025-01-18 12:47:51

def transform(self, X, y=None):

这里你写了 y = None，这意味着如果你没有传递任何 y 值，那么它将采用默认值 None。

为了强制管道传递 ay 值，你应该写

def transform(self, X, y):
     pass

如果你这样做，那么你必须传递 ay 值，否则它将返回一个错误

我正在谈论的空间问题

class SpacyVectorizer:
    def __init__(
      self, 
      alpha_only: bool = True,
      lemmatize: bool = True, 
      remove_stopwords: bool = True, 
      case_fold: bool = True,
    ):
        self.alpha_only = alpha_only
        self.lemmatize = lemmatize
        self.remove_stopwords = remove_stopwords
        self.case_fold = case_fold
        self.nlp = spacy.load(
          name='en_core_web_sm', 
          disable=["parser", "ner"]
        )
    def transform(self, X, y):
    # Bag-of-Words matrix
        bow_matrix = []

        # Iterate over documents in SpaCy pipeline 
        for i, doc in enumerate(nlp.pipe(X)):
          # Words array
          words = []

          # Tokenize document
          for token in doc:

            # Remove non-alphanumeric tokens
            if self.alpha_only and not token.is_alpha:
              continue

            # Stopword removal
            if self.remove_stopwords and token.is_stop:
              continue

            # Lemmatization
            if self.lemmatize:
              token = token.lemma_

            # Case folding
            if self.case_fold:
              token = str(token).casefold()

            # Append token to words array
            words.append(token)

          # Update the Bow representation
          if words:
            # Preprocessed document
            new_doc = ' '.join(words)

            # L2-normalized vector of preprocessed document
            word_vec = nlp(new_doc).vector

          else:
            # Remove target label
            y.drop(y.index[i], inplace=True)

          # Update the BoW matrix
          bow_matrix.append(word_vec)

        # Return BoW matrix  
        return bow_matrix

你得到的错误可能是因为空间问题，因为 self 可能采用 x 值，而 X 参数可能采用 y 值

def transform(self, X, y=None):

Here you have written y = None, which means if you aren't passing any y value then it's taking a default value as None.

In order to force a pipeline to pass a y value u should write

def transform(self, X, y):
     pass

If you do this then you have to pass a y value, else it will return a error

the space problem I am talking about

class SpacyVectorizer:
    def __init__(
      self, 
      alpha_only: bool = True,
      lemmatize: bool = True, 
      remove_stopwords: bool = True, 
      case_fold: bool = True,
    ):
        self.alpha_only = alpha_only
        self.lemmatize = lemmatize
        self.remove_stopwords = remove_stopwords
        self.case_fold = case_fold
        self.nlp = spacy.load(
          name='en_core_web_sm', 
          disable=["parser", "ner"]
        )
    def transform(self, X, y):
    # Bag-of-Words matrix
        bow_matrix = []

        # Iterate over documents in SpaCy pipeline 
        for i, doc in enumerate(nlp.pipe(X)):
          # Words array
          words = []

          # Tokenize document
          for token in doc:

            # Remove non-alphanumeric tokens
            if self.alpha_only and not token.is_alpha:
              continue

            # Stopword removal
            if self.remove_stopwords and token.is_stop:
              continue

            # Lemmatization
            if self.lemmatize:
              token = token.lemma_

            # Case folding
            if self.case_fold:
              token = str(token).casefold()

            # Append token to words array
            words.append(token)

          # Update the Bow representation
          if words:
            # Preprocessed document
            new_doc = ' '.join(words)

            # L2-normalized vector of preprocessed document
            word_vec = nlp(new_doc).vector

          else:
            # Remove target label
            y.drop(y.index[i], inplace=True)

          # Update the BoW matrix
          bow_matrix.append(word_vec)

        # Return BoW matrix  
        return bow_matrix

The error you are getting might be because of the space problem, as self might be taking x value and X parameter might be taking y value

回复收藏 0 原文

~没有更多了~