Unicodedecodeerror:' utf-8'编解码器可以在位置139390中解码字节0xe1:从Twitter API刮擦时无效的连续字节
我有一个使用python软件包构建的网络拖纸器,我始终使用它来收集推文以进行研究。突然,它似乎不再起作用了。问题是不能再解码所有字符吗?
# open and create a file to append the data to
csvFile = open('tweets.csv', 'a')
csvWriter = csv.writer(csvFile)
# use the csv file
# loop through the tweets variable and add contents to the CSV file
for tweet in tweets:
text = tweet.full_text.strip()
#convert the text to ascii ignoring all unicode characters, eg. emojis
text_ascii = text.encode('ascii','ignore').decode()
#split the text on whitespace and newlines into a list of words
text_list = text_ascii.split()
#iterate over the words, removing @ mentions or URLs
text_list_filtered = [word for word in text_list if not (word.startswith('@') or word.startswith('http'))]
#join the list back into a string
text_filtered = ' '.join(text_list_filtered)
#decoding html escaped characters
text_filtered = html.unescape(text_filtered)
#write text to the CSV file
csvWriter.writerow([tweet.created_at, tweet.place, text_filtered])
print(tweet.created_at, tweet.place, text_filtered)
csvFile.close()
因此,当我尝试将其阅读为Pandas DataFrame时,我会收到此错误:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 139390: invalid continuation byte
行是:
tweetsdf = pd.read_csv('tweets.csv')
我试图将以下代码更改为
text_ascii = text.encode('ascii','ignore').decode()
:
text_ascii = text.encode('utf-8','ignore').decode()
给我错误的 当我尝试从API收集推文时。我应该怎么办?
I have a web-scraper built using the python package tweepy and I always use it to gather tweets for research. Suddenly, it doesn't seem to work anymore. The issue is it can no longer decode all the characters?
# open and create a file to append the data to
csvFile = open('tweets.csv', 'a')
csvWriter = csv.writer(csvFile)
# use the csv file
# loop through the tweets variable and add contents to the CSV file
for tweet in tweets:
text = tweet.full_text.strip()
#convert the text to ascii ignoring all unicode characters, eg. emojis
text_ascii = text.encode('ascii','ignore').decode()
#split the text on whitespace and newlines into a list of words
text_list = text_ascii.split()
#iterate over the words, removing @ mentions or URLs
text_list_filtered = [word for word in text_list if not (word.startswith('@') or word.startswith('http'))]
#join the list back into a string
text_filtered = ' '.join(text_list_filtered)
#decoding html escaped characters
text_filtered = html.unescape(text_filtered)
#write text to the CSV file
csvWriter.writerow([tweet.created_at, tweet.place, text_filtered])
print(tweet.created_at, tweet.place, text_filtered)
csvFile.close()
so when I try to read it as a pandas dataframe I get this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 139390: invalid continuation byte
The line that is giving me the error is this:
tweetsdf = pd.read_csv('tweets.csv')
I have tried to change the following bit of code from this:
text_ascii = text.encode('ascii','ignore').decode()
to this:
text_ascii = text.encode('utf-8','ignore').decode()
But then I get the same problem when I try to collect the tweets from the API. What should I do?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
E1字节似乎是由以下事实引起的:对于某些推文,代码打印了发布推文的位置。从“
csvwriter.writerow([[Tweet.created_at,Tweet.place,text_filtered text_filtered)删除'
tweet.place
'之后再次将CSV文件放入PANDAS DataFrame。The E1 byte seems to be caused by the fact that for some tweets, the code prints the location where the tweet was posted. After removing '
tweet.place
' from "csvWriter.writerow([tweet.created_at, tweet.place, text_filtered])
", the error disappeared and I was able to read the csv file into a pandas dataframe again.