自定义 sklearn Pipeline 来转换 X 和 y

发布于 2025-01-11 12:47:51 字数 1979 浏览 1 评论 0原文

我创建了自己的文本处理自定义管道。在 .transform() 方法中,我想如果没有标记则删除目标行

class SpacyVectorizer(BaseEstimator, TransformerMixin):
  def __init__(
      self, 
      alpha_only: bool = True,
      lemmatize: bool = True, 
      remove_stopwords: bool = True, 
      case_fold: bool = True,
    ):
    self.alpha_only = alpha_only
    self.lemmatize = lemmatize
    self.remove_stopwords = remove_stopwords
    self.case_fold = case_fold
    self.nlp = spacy.load(
      name='en_core_web_sm', 
      disable=["parser", "ner"]
    )
  
  def fit(self, X, y=None):
    return self
  
  def transform(self, X, y):
    # Bag-of-Words matrix
    bow_matrix = []
    
    # Iterate over documents in SpaCy pipeline 
    for i, doc in enumerate(nlp.pipe(X)):
      # Words array
      words = []

      # Tokenize document
      for token in doc:

        # Remove non-alphanumeric tokens
        if self.alpha_only and not token.is_alpha:
          continue
        
        # Stopword removal
        if self.remove_stopwords and token.is_stop:
          continue
        
        # Lemmatization
        if self.lemmatize:
          token = token.lemma_
        
        # Case folding
        if self.case_fold:
          token = str(token).casefold()

        # Append token to words array
        words.append(token)
      
      # Update the Bow representation
      if words:
        # Preprocessed document
        new_doc = ' '.join(words)
        
        # L2-normalized vector of preprocessed document
        word_vec = nlp(new_doc).vector
      
      else:
        # Remove target label
        y.drop(y.index[i], inplace=True)

      # Update the BoW matrix
      bow_matrix.append(word_vec)

    # Return BoW matrix  
    return bow_matrix

不幸的是,因为我无法将 y 向量传递给 .transform() 方法,所以它不起作用。

如何强制管道传递 Xy 参数? 关于如何做到这一点还有其他解决方法吗? 我不想通过 .fit_transform() 传递 y,因为不应拟合测试数据。

I created my own custom pipeline for text processing. Inside the .transform() method, I want to remove the target row if there are no tokens.

class SpacyVectorizer(BaseEstimator, TransformerMixin):
  def __init__(
      self, 
      alpha_only: bool = True,
      lemmatize: bool = True, 
      remove_stopwords: bool = True, 
      case_fold: bool = True,
    ):
    self.alpha_only = alpha_only
    self.lemmatize = lemmatize
    self.remove_stopwords = remove_stopwords
    self.case_fold = case_fold
    self.nlp = spacy.load(
      name='en_core_web_sm', 
      disable=["parser", "ner"]
    )
  
  def fit(self, X, y=None):
    return self
  
  def transform(self, X, y):
    # Bag-of-Words matrix
    bow_matrix = []
    
    # Iterate over documents in SpaCy pipeline 
    for i, doc in enumerate(nlp.pipe(X)):
      # Words array
      words = []

      # Tokenize document
      for token in doc:

        # Remove non-alphanumeric tokens
        if self.alpha_only and not token.is_alpha:
          continue
        
        # Stopword removal
        if self.remove_stopwords and token.is_stop:
          continue
        
        # Lemmatization
        if self.lemmatize:
          token = token.lemma_
        
        # Case folding
        if self.case_fold:
          token = str(token).casefold()

        # Append token to words array
        words.append(token)
      
      # Update the Bow representation
      if words:
        # Preprocessed document
        new_doc = ' '.join(words)
        
        # L2-normalized vector of preprocessed document
        word_vec = nlp(new_doc).vector
      
      else:
        # Remove target label
        y.drop(y.index[i], inplace=True)

      # Update the BoW matrix
      bow_matrix.append(word_vec)

    # Return BoW matrix  
    return bow_matrix

Unfortunately, because I cannot pass the y vector to the .transform() method, it does not work.

How can I force the pipeline to pass both X and y parameters?
Is there any other workaround on how to do it?
I don't want to pass y via .fit_transform(), because test data shouldn't be fitted.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

说好的呢 2025-01-18 12:47:51
def transform(self, X, y=None):

这里你写了 y = None,这意味着如果你没有传递任何 y 值,那么它将采用默认值 None。

为了强制管道传递 ay 值,你应该写

def transform(self, X, y):
     pass

如果你这样做,那么你必须传递 ay 值,否则它将返回一个错误

我正在谈论的空间问题

class SpacyVectorizer:
    def __init__(
      self, 
      alpha_only: bool = True,
      lemmatize: bool = True, 
      remove_stopwords: bool = True, 
      case_fold: bool = True,
    ):
        self.alpha_only = alpha_only
        self.lemmatize = lemmatize
        self.remove_stopwords = remove_stopwords
        self.case_fold = case_fold
        self.nlp = spacy.load(
          name='en_core_web_sm', 
          disable=["parser", "ner"]
        )
    def transform(self, X, y):
    # Bag-of-Words matrix
        bow_matrix = []

        # Iterate over documents in SpaCy pipeline 
        for i, doc in enumerate(nlp.pipe(X)):
          # Words array
          words = []

          # Tokenize document
          for token in doc:

            # Remove non-alphanumeric tokens
            if self.alpha_only and not token.is_alpha:
              continue

            # Stopword removal
            if self.remove_stopwords and token.is_stop:
              continue

            # Lemmatization
            if self.lemmatize:
              token = token.lemma_

            # Case folding
            if self.case_fold:
              token = str(token).casefold()

            # Append token to words array
            words.append(token)

          # Update the Bow representation
          if words:
            # Preprocessed document
            new_doc = ' '.join(words)

            # L2-normalized vector of preprocessed document
            word_vec = nlp(new_doc).vector

          else:
            # Remove target label
            y.drop(y.index[i], inplace=True)

          # Update the BoW matrix
          bow_matrix.append(word_vec)

        # Return BoW matrix  
        return bow_matrix

你得到的错误可能是因为空间问题,因为 self 可能采用 x 值,而 X 参数可能采用 y 值

def transform(self, X, y=None):

Here you have written y = None, which means if you aren't passing any y value then it's taking a default value as None.

In order to force a pipeline to pass a y value u should write

def transform(self, X, y):
     pass

If you do this then you have to pass a y value, else it will return a error

the space problem I am talking about

class SpacyVectorizer:
    def __init__(
      self, 
      alpha_only: bool = True,
      lemmatize: bool = True, 
      remove_stopwords: bool = True, 
      case_fold: bool = True,
    ):
        self.alpha_only = alpha_only
        self.lemmatize = lemmatize
        self.remove_stopwords = remove_stopwords
        self.case_fold = case_fold
        self.nlp = spacy.load(
          name='en_core_web_sm', 
          disable=["parser", "ner"]
        )
    def transform(self, X, y):
    # Bag-of-Words matrix
        bow_matrix = []

        # Iterate over documents in SpaCy pipeline 
        for i, doc in enumerate(nlp.pipe(X)):
          # Words array
          words = []

          # Tokenize document
          for token in doc:

            # Remove non-alphanumeric tokens
            if self.alpha_only and not token.is_alpha:
              continue

            # Stopword removal
            if self.remove_stopwords and token.is_stop:
              continue

            # Lemmatization
            if self.lemmatize:
              token = token.lemma_

            # Case folding
            if self.case_fold:
              token = str(token).casefold()

            # Append token to words array
            words.append(token)

          # Update the Bow representation
          if words:
            # Preprocessed document
            new_doc = ' '.join(words)

            # L2-normalized vector of preprocessed document
            word_vec = nlp(new_doc).vector

          else:
            # Remove target label
            y.drop(y.index[i], inplace=True)

          # Update the BoW matrix
          bow_matrix.append(word_vec)

        # Return BoW matrix  
        return bow_matrix

The error you are getting might be because of the space problem, as self might be taking x value and X parameter might be taking y value

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文