如何解决 TypeError: iteration over a 0-d array 和 TypeError: Cannot use a string pattern on a bytes-like object
我正在尝试对我的数据应用预处理步骤。我有 6 个函数来预处理数据,我在预处理函数中调用这些函数。当我用例句一一尝试这些功能时,它就起作用了。
data = "AN example 1 Sentence !!"
def preprocess(data):
data = convert_lower_case(data)
# data = convert_number(data)
# data = remove_punctuation(data)
# data = remove_stopwords(data)
# data = stem_words(data)
# data = lemmatize_word(data)
return data
processed_text = []
processed_text.append(word_tokenize(str(preprocess(data))))
print(processed_text)
但是当我注释掉另一个函数时,它会出错。当我将函数一一注释掉时,它会给出不同的错误。
这些是错误:
AttributeError: 'numpy.ndarray' object has no attribute 'split'
TypeError: iteration over a 0-d array
TypeError: cannot use a string pattern on a bytes-like object
功能单独工作但不能一起工作的原因可能是什么?我该如何解决这个问题并使用这些函数来预处理我的数据?
提前致谢。
我使用的功能:
def convert_lower_case(data):
return np.char.lower(data)
def remove_stopwords(text):
stop_words = set(stopwords.words("english"))
word_tokens = word_tokenize(text)
filtered_text = [word for word in word_tokens if word not in stop_words]
return filtered_text
def remove_punctuation(data):
punctuationfree="".join([i for i in data if i not in string.punctuation])
return punctuationfree
def stem_words(data):
stemmer = PorterStemmer()
word_tokens = word_tokenize(data)
stems = [stemmer.stem(word) for word in word_tokens]
return stems
def convert_number(data):
temp_str = data.split()
new_string = []
for word in temp_str:
if word.isdigit():
temp = p.number_to_words(word)
new_string.append(temp)
else:
new_string.append(word)
temp_str = ' '.join(new_string)
return temp_str
def lemmatize_word(text):
lemmatizer = WordNetLemmatizer()
word_tokens = word_tokenize(text)
lemmas = [lemmatizer.lemmatize(word, pos ='v') for word in word_tokens]
return lemmas
I am trying to apply preprocessing steps to my data. I have 6 functions to preprocess data and I call these functions in preprocess function. It works when I try these functions one by one with the example sentence.
data = "AN example 1 Sentence !!"
def preprocess(data):
data = convert_lower_case(data)
# data = convert_number(data)
# data = remove_punctuation(data)
# data = remove_stopwords(data)
# data = stem_words(data)
# data = lemmatize_word(data)
return data
processed_text = []
processed_text.append(word_tokenize(str(preprocess(data))))
print(processed_text)
But when I comment out another function, it gives an error. When I comment out the functions one by one, it gives different errors.
These are errors:
AttributeError: 'numpy.ndarray' object has no attribute 'split'
TypeError: iteration over a 0-d array
TypeError: cannot use a string pattern on a bytes-like object
What could be the reason for the functions working separately but not working together? How can I solve this problem and use these functions to preprocess my data?
Thanks in advance.
Functions that I used:
def convert_lower_case(data):
return np.char.lower(data)
def remove_stopwords(text):
stop_words = set(stopwords.words("english"))
word_tokens = word_tokenize(text)
filtered_text = [word for word in word_tokens if word not in stop_words]
return filtered_text
def remove_punctuation(data):
punctuationfree="".join([i for i in data if i not in string.punctuation])
return punctuationfree
def stem_words(data):
stemmer = PorterStemmer()
word_tokens = word_tokenize(data)
stems = [stemmer.stem(word) for word in word_tokens]
return stems
def convert_number(data):
temp_str = data.split()
new_string = []
for word in temp_str:
if word.isdigit():
temp = p.number_to_words(word)
new_string.append(temp)
else:
new_string.append(word)
temp_str = ' '.join(new_string)
return temp_str
def lemmatize_word(text):
lemmatizer = WordNetLemmatizer()
word_tokens = word_tokenize(text)
lemmas = [lemmatizer.lemmatize(word, pos ='v') for word in word_tokens]
return lemmas
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
可以识别的第一个问题是您的
convert_lower_case
返回的内容与它接受的内容不同 - 如果处理得当,这可能完全没问题。但是你继续将你的data
视为字符串,在data = Convert_lower_case(data)
之后它不再是“但是当我打印它时它看起来像一个字符串” - 是的,但它不是一个字符串。你可以看到,如果你这样做:
输出:
老实说,你在这里有点重新发明轮子,因为Python已经有内置的 .lower() 函数,它将返回一个实际的
string
类型对象大写字母改为小写字母。其他功能中也可能会出现类似问题。
First problem that can be identified is that your
convert_lower_case
returns something different than it accepts - which could be perfectly fine, if treated properly. But you keep treating yourdata
as a string, which it no longer is afterdata = convert_lower_case(data)
"But it looks like a string when I print it" - yeah, but it isn't a string. You can see that if you do this:
Output:
Honestly, you are reinventing the wheel here a bit, because Python already has built-in .lower() function that will return you an actual
string
type object with capitals changed to small letters.Similiar issues might occur in other functions.