Unicodedecodeerror：＆＃x27; utf-8＆＃x27;编解码器可以在位置139390中解码字节0xe1：从Twitter API刮擦时无效的连续字节

发布于 2025-02-13 23:19:16 字数 1581 浏览 3 评论 0原文

我有一个使用python软件包构建的网络拖纸器，我始终使用它来收集推文以进行研究。突然，它似乎不再起作用了。问题是不能再解码所有字符吗？

# open and create a file to append the data to
csvFile = open('tweets.csv', 'a')
csvWriter = csv.writer(csvFile)
    # use the csv file
    # loop through the tweets variable and add contents to the CSV file
for tweet in tweets:
    text = tweet.full_text.strip()
    #convert the text to ascii ignoring all unicode characters, eg. emojis
    text_ascii = text.encode('ascii','ignore').decode()
    #split the text on whitespace and newlines into a list of words
    text_list = text_ascii.split()
    #iterate over the words, removing @ mentions or URLs 
    text_list_filtered = [word for word in text_list if not (word.startswith('@') or word.startswith('http'))]
    #join the list back into a string
    text_filtered = ' '.join(text_list_filtered)
    #decoding html escaped characters
    text_filtered = html.unescape(text_filtered)
    #write text to the CSV file
    csvWriter.writerow([tweet.created_at, tweet.place, text_filtered])
    print(tweet.created_at, tweet.place, text_filtered)
csvFile.close()

因此，当我尝试将其阅读为Pandas DataFrame时，我会收到此错误：

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 139390: invalid continuation byte

行是：

tweetsdf = pd.read_csv('tweets.csv')

我试图将以下代码更改为

text_ascii = text.encode('ascii','ignore').decode()

：

text_ascii = text.encode('utf-8','ignore').decode()

给我错误的当我尝试从API收集推文时。我应该怎么办？

原文

I have a web-scraper built using the python package tweepy and I always use it to gather tweets for research. Suddenly, it doesn't seem to work anymore. The issue is it can no longer decode all the characters?

# open and create a file to append the data to
csvFile = open('tweets.csv', 'a')
csvWriter = csv.writer(csvFile)
    # use the csv file
    # loop through the tweets variable and add contents to the CSV file
for tweet in tweets:
    text = tweet.full_text.strip()
    #convert the text to ascii ignoring all unicode characters, eg. emojis
    text_ascii = text.encode('ascii','ignore').decode()
    #split the text on whitespace and newlines into a list of words
    text_list = text_ascii.split()
    #iterate over the words, removing @ mentions or URLs 
    text_list_filtered = [word for word in text_list if not (word.startswith('@') or word.startswith('http'))]
    #join the list back into a string
    text_filtered = ' '.join(text_list_filtered)
    #decoding html escaped characters
    text_filtered = html.unescape(text_filtered)
    #write text to the CSV file
    csvWriter.writerow([tweet.created_at, tweet.place, text_filtered])
    print(tweet.created_at, tweet.place, text_filtered)
csvFile.close()

so when I try to read it as a pandas dataframe I get this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 139390: invalid continuation byte

The line that is giving me the error is this:

tweetsdf = pd.read_csv('tweets.csv')

I have tried to change the following bit of code from this:

text_ascii = text.encode('ascii','ignore').decode()

to this:

text_ascii = text.encode('utf-8','ignore').decode()

But then I get the same problem when I try to collect the tweets from the API. What should I do?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

或十年 2025-02-20 23:19:16

E1字节似乎是由以下事实引起的：对于某些推文，代码打印了发布推文的位置。从“ csvwriter.writerow（[[Tweet.created_at，Tweet.place，text_filtered text_filtered）删除'tweet.place'之后再次将CSV文件放入PANDAS DataFrame。

回复收藏 0 原文

~没有更多了~