如何解决 TypeError: iteration over a 0-d array 和 TypeError: Cannot use a string pattern on a bytes-like object

发布于 2025-01-15 07:26:20 字数 1938 浏览 1 评论 0原文

我正在尝试对我的数据应用预处理步骤。我有 6 个函数来预处理数据，我在预处理函数中调用这些函数。当我用例句一一尝试这些功能时，它就起作用了。

data = "AN example 1 Sentence !!"

def preprocess(data):
    data = convert_lower_case(data)
    # data = convert_number(data)
    # data = remove_punctuation(data)
    # data = remove_stopwords(data)
    # data = stem_words(data)
    # data = lemmatize_word(data)
    return data

processed_text = []
processed_text.append(word_tokenize(str(preprocess(data))))
print(processed_text)

但是当我注释掉另一个函数时，它会出错。当我将函数一一注释掉时，它会给出不同的错误。

这些是错误：

AttributeError: 'numpy.ndarray' object has no attribute 'split'

TypeError: iteration over a 0-d array

TypeError: cannot use a string pattern on a bytes-like object

功能单独工作但不能一起工作的原因可能是什么？我该如何解决这个问题并使用这些函数来预处理我的数据？

提前致谢。

我使用的功能：

def convert_lower_case(data):
    return np.char.lower(data)

def remove_stopwords(text):
    stop_words = set(stopwords.words("english"))
    word_tokens = word_tokenize(text)
    filtered_text = [word for word in word_tokens if word not in stop_words]
    return filtered_text

def remove_punctuation(data):
    punctuationfree="".join([i for i in data if i not in string.punctuation])
    return punctuationfree

def stem_words(data):
    stemmer = PorterStemmer()
    word_tokens = word_tokenize(data)
    stems = [stemmer.stem(word) for word in word_tokens]
    return stems

def convert_number(data):
    temp_str = data.split()
    new_string = []
    for word in temp_str:
        if word.isdigit():
            temp = p.number_to_words(word)
            new_string.append(temp)
        else:
            new_string.append(word)
    temp_str = ' '.join(new_string)
    return temp_str

def lemmatize_word(text):
    lemmatizer = WordNetLemmatizer()
    word_tokens = word_tokenize(text)
    lemmas = [lemmatizer.lemmatize(word, pos ='v') for word in word_tokens]
    return lemmas

原文

I am trying to apply preprocessing steps to my data. I have 6 functions to preprocess data and I call these functions in preprocess function. It works when I try these functions one by one with the example sentence.

data = "AN example 1 Sentence !!"

def preprocess(data):
    data = convert_lower_case(data)
    # data = convert_number(data)
    # data = remove_punctuation(data)
    # data = remove_stopwords(data)
    # data = stem_words(data)
    # data = lemmatize_word(data)
    return data

processed_text = []
processed_text.append(word_tokenize(str(preprocess(data))))
print(processed_text)

But when I comment out another function, it gives an error. When I comment out the functions one by one, it gives different errors.

These are errors:

AttributeError: 'numpy.ndarray' object has no attribute 'split'

TypeError: iteration over a 0-d array

TypeError: cannot use a string pattern on a bytes-like object

What could be the reason for the functions working separately but not working together? How can I solve this problem and use these functions to preprocess my data?

Thanks in advance.

Functions that I used:

def convert_lower_case(data):
    return np.char.lower(data)

def remove_stopwords(text):
    stop_words = set(stopwords.words("english"))
    word_tokens = word_tokenize(text)
    filtered_text = [word for word in word_tokens if word not in stop_words]
    return filtered_text

def remove_punctuation(data):
    punctuationfree="".join([i for i in data if i not in string.punctuation])
    return punctuationfree

def stem_words(data):
    stemmer = PorterStemmer()
    word_tokens = word_tokenize(data)
    stems = [stemmer.stem(word) for word in word_tokens]
    return stems

def convert_number(data):
    temp_str = data.split()
    new_string = []
    for word in temp_str:
        if word.isdigit():
            temp = p.number_to_words(word)
            new_string.append(temp)
        else:
            new_string.append(word)
    temp_str = ' '.join(new_string)
    return temp_str

def lemmatize_word(text):
    lemmatizer = WordNetLemmatizer()
    word_tokens = word_tokenize(text)
    lemmas = [lemmatizer.lemmatize(word, pos ='v') for word in word_tokens]
    return lemmas

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一个人的旅程 2025-01-22 07:26:20

可以识别的第一个问题是您的 convert_lower_case 返回的内容与它接受的内容不同 - 如果处理得当，这可能完全没问题。但是你继续将你的data视为字符串，在data = Convert_lower_case(data)之后它不再是

“但是当我打印它时它看起来像一个字符串” - 是的，但它不是一个字符串。你可以看到，如果你这样做：

def convert_lower_case(data):
    print(type(data))
    new_data = np.char.lower(data)
    print(type(new_data))
    return new_data

输出：

<class 'str'>
<class 'numpy.ndarray'>

老实说，你在这里有点重新发明轮子，因为Python已经有内置的 .lower() 函数，它将返回一个实际的 string 类型对象大写字母改为小写字母。
其他功能中也可能会出现类似问题。

First problem that can be identified is that your convert_lower_case returns something different than it accepts - which could be perfectly fine, if treated properly. But you keep treating your data as a string, which it no longer is after data = convert_lower_case(data)

"But it looks like a string when I print it" - yeah, but it isn't a string. You can see that if you do this:

def convert_lower_case(data):
    print(type(data))
    new_data = np.char.lower(data)
    print(type(new_data))
    return new_data

Output:

<class 'str'>
<class 'numpy.ndarray'>

Honestly, you are reinventing the wheel here a bit, because Python already has built-in .lower() function that will return you an actual string type object with capitals changed to small letters.
Similiar issues might occur in other functions.

回复收藏 0 原文

~没有更多了~