如何在Python中的对象实例化期间访问数据框列作为文本

发布于 2025-01-22 12:13:24 字数 1350 浏览 0 评论 0原文

我正在尝试创建一个类来预处理文本数据集。创建我的课程的实例后，我想调用类中的一些方法在数据框中的列上应用，但行不通。这是我尝试的，

class Preprocessor:
def __init__(self, dataset):
  self.dataset = dataset

def strip_html(self,text):
  soup = BeautifulSoup(text, "html.parser")
  return soup.get_text()

def remove_between_square_brackets(self,text):
  return re.sub('\[[^]]*\]', '', text)

def denoise_text(self,text):
  text = self.strip_html(text)
  text = self.remove_between_square_brackets(text)
  return text

我尝试在这里调用这些方法

trial = Preprocessor(dataset['review'])
trial.strip_html(dataset['review'])

，我会收到此错误消息

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-45-26f1c4298563> in <module>()
----> 1 trial.strip_html(dataset['review'])

6 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py in __nonzero__(self)
   1536     def __nonzero__(self):
   1537         raise ValueError(
-> 1538             f"The truth value of a {type(self).__name__} is ambiguous. "
   1539             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
   1540         )

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

原文

I am trying to create a class to pre-process a text dataset. After creating an instance of my class, I want to call some methods from the class to apply on a column in the data frame but it does not work. This is what I tried

class Preprocessor:
def __init__(self, dataset):
  self.dataset = dataset

def strip_html(self,text):
  soup = BeautifulSoup(text, "html.parser")
  return soup.get_text()

def remove_between_square_brackets(self,text):
  return re.sub('\[[^]]*\]', '', text)

def denoise_text(self,text):
  text = self.strip_html(text)
  text = self.remove_between_square_brackets(text)
  return text

I try calling the methods here

trial = Preprocessor(dataset['review'])
trial.strip_html(dataset['review'])

I get this error message

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-45-26f1c4298563> in <module>()
----> 1 trial.strip_html(dataset['review'])

6 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/generic.py in __nonzero__(self)
   1536     def __nonzero__(self):
   1537         raise ValueError(
-> 1538             f"The truth value of a {type(self).__name__} is ambiguous. "
   1539             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
   1540         )

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

贵在坚持 2025-01-29 12:13:24

beautifulsoup's get_text（）方法期望将字符串作为输入。因此，它不能直接与 pandas系列。

实现这一目标的一种方法是迭代系列中的每个元素，将方法应用到它：

import pandas as pd
from bs4 import BeautifulSoup


class Preprocessor:
    def __init__(self, dataset):
        self.dataset = dataset

    @staticmethod
    def soup_and_strip(text):
        soup = BeautifulSoup(text, "html.parser")
        return soup.get_text()

    def strip_html(self):
        return self.dataset.apply(self.soup_and_strip)


if __name__ == '__main__':
    df = pd.DataFrame(
        {'review': ['<b>good</b>', '<i>excellent</i>', '<h1>splendid</h1>']})
    trial = Preprocessor(df['review'])
    print(trial.strip_html())

注释：
您对预科人员的总体想法很好，但是它的实现有些怪异。您可以使用所需的数据来启动预科人员，但是您没有直接使用此数据，而是将数据作为参数再次提供。您可能想查找一些 tutorials> tutorials 关于课堂用法。

我给您的另一个建议是适当地命名您的论点。称论点为“文本”，但是提供熊猫系列的剧集令人困惑（在这种情况下 - 您提出的问题的来源）

BeautifulSoup's get_text() method expects a string as input. Hence it cannot be directly used with a pandas series.

One way to achieve this would be to iterate over each element in the series and apply the method to it:

import pandas as pd
from bs4 import BeautifulSoup


class Preprocessor:
    def __init__(self, dataset):
        self.dataset = dataset

    @staticmethod
    def soup_and_strip(text):
        soup = BeautifulSoup(text, "html.parser")
        return soup.get_text()

    def strip_html(self):
        return self.dataset.apply(self.soup_and_strip)


if __name__ == '__main__':
    df = pd.DataFrame(
        {'review': ['<b>good</b>', '<i>excellent</i>', '<h1>splendid</h1>']})
    trial = Preprocessor(df['review'])
    print(trial.strip_html())

Remark:
Your overall idea of a prepocessor is good, but the implementation of it is a bit weird. You init the Prepocessor with the required data, but instead of using this data directly its methods, you provide the data again as an argument. You might want to look up some tutorials regarding class usage.

Another advice I would give you is to name your arguments appropriatly. Calling an argument "text", but providing a pandas series is confusing (and - in this case - the source of your presented problem)

回复收藏 0 原文

~没有更多了~