Unicodedecodeerror:' utf-8'编解码器可以在位置139390中解码字节0xe1:从Twitter API刮擦时无效的连续字节

发布于 2025-02-13 23:19:16 字数 1581 浏览 3 评论 0原文

我有一个使用python软件包构建的网络拖纸器,我始终使用它来收集推文以进行研究。突然,它似乎不再起作用了。问题是不能再解码所有字符吗?

# open and create a file to append the data to
csvFile = open('tweets.csv', 'a')
csvWriter = csv.writer(csvFile)
    # use the csv file
    # loop through the tweets variable and add contents to the CSV file
for tweet in tweets:
    text = tweet.full_text.strip()
    #convert the text to ascii ignoring all unicode characters, eg. emojis
    text_ascii = text.encode('ascii','ignore').decode()
    #split the text on whitespace and newlines into a list of words
    text_list = text_ascii.split()
    #iterate over the words, removing @ mentions or URLs 
    text_list_filtered = [word for word in text_list if not (word.startswith('@') or word.startswith('http'))]
    #join the list back into a string
    text_filtered = ' '.join(text_list_filtered)
    #decoding html escaped characters
    text_filtered = html.unescape(text_filtered)
    #write text to the CSV file
    csvWriter.writerow([tweet.created_at, tweet.place, text_filtered])
    print(tweet.created_at, tweet.place, text_filtered)
csvFile.close() 

因此,当我尝试将其阅读为Pandas DataFrame时,我会收到此错误:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 139390: invalid continuation byte

行是:

tweetsdf = pd.read_csv('tweets.csv')

我试图将以下代码更改为

text_ascii = text.encode('ascii','ignore').decode()

text_ascii = text.encode('utf-8','ignore').decode()

给我错误的 当我尝试从API收集推文时。我应该怎么办?

I have a web-scraper built using the python package tweepy and I always use it to gather tweets for research. Suddenly, it doesn't seem to work anymore. The issue is it can no longer decode all the characters?

# open and create a file to append the data to
csvFile = open('tweets.csv', 'a')
csvWriter = csv.writer(csvFile)
    # use the csv file
    # loop through the tweets variable and add contents to the CSV file
for tweet in tweets:
    text = tweet.full_text.strip()
    #convert the text to ascii ignoring all unicode characters, eg. emojis
    text_ascii = text.encode('ascii','ignore').decode()
    #split the text on whitespace and newlines into a list of words
    text_list = text_ascii.split()
    #iterate over the words, removing @ mentions or URLs 
    text_list_filtered = [word for word in text_list if not (word.startswith('@') or word.startswith('http'))]
    #join the list back into a string
    text_filtered = ' '.join(text_list_filtered)
    #decoding html escaped characters
    text_filtered = html.unescape(text_filtered)
    #write text to the CSV file
    csvWriter.writerow([tweet.created_at, tweet.place, text_filtered])
    print(tweet.created_at, tweet.place, text_filtered)
csvFile.close() 

so when I try to read it as a pandas dataframe I get this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 139390: invalid continuation byte

The line that is giving me the error is this:

tweetsdf = pd.read_csv('tweets.csv')

I have tried to change the following bit of code from this:

text_ascii = text.encode('ascii','ignore').decode()

to this:

text_ascii = text.encode('utf-8','ignore').decode()

But then I get the same problem when I try to collect the tweets from the API. What should I do?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

或十年 2025-02-20 23:19:16

E1字节似乎是由以下事实引起的:对于某些推文,代码打印了发布推文的位置。从“ csvwriter.writerow([[Tweet.created_at,Tweet.place,text_filtered text_filtered)删除'tweet.place'之后再次将CSV文件放入PANDAS DataFrame。

The E1 byte seems to be caused by the fact that for some tweets, the code prints the location where the tweet was posted. After removing 'tweet.place' from " csvWriter.writerow([tweet.created_at, tweet.place, text_filtered])", the error disappeared and I was able to read the csv file into a pandas dataframe again.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文