某些字符（商标符号等）无法写入文件，但可以在屏幕上打印

发布于 2024-10-19 17:09:55 字数 1110 浏览 10 评论 0原文

我一直在尝试从网站上抓取数据并将找到的数据写入文件。超过 90% 的情况下，我不会遇到 Unicode 错误，但是当数据具有以下字符（例如“Burger King®, Hans Café”）时，它不喜欢将其写入文件，因此我的错误处理会打印它按原样显示在屏幕上，没有任何进一步的错误。

我尝试过编码和解码功能以及各种编码，但无济于事。

请查看我在下面编写的当前代码的摘录：

import urllib2,sys
import re
import os
import urllib
import string
import time
from BeautifulSoup import BeautifulSoup,NavigableString, SoupStrainer
from string import maketrans
import codecs

f=codecs.open('alldetails7.txt', mode='w', encoding='utf-8', errors='replace')
...


soup5 = BeautifulSoup(html5)
enc_s5 = soup5.originalEncoding

for company in iter(soup5.findAll(height="20px")):
    stream = ""
    count_detail = 1
    for tag in iter(company.findAll('td')):
        if count_detail > 1:
           stream = stream + tag.text.replace(u',',u';')
           if count_detail < 4 :
              stream=stream+","
        count_detail = count_detail + 1
    stream.strip()
    try:
        f.write(str(stnum)+","+br_name_addr+","+stream.decode(enc_s5)+os.linesep)
    except:
        print "Unicode error ->"+str(storenum)+","+branch_name_address+","+stream

原文

I've been trying to scrape data from a website and write out the data that I find to a file. More than 90% of the time, I don't run into Unicode errors but when the data has the following characters such as "Burger King®, Hans Café", it doesn't like writing that into the file so my error handling prints it to the screen as is and without any further errors.

I've tried the encode and decode functions and the various encodings but to no avail.

Please find an excerpt of the current code that I've written below:

import urllib2,sys
import re
import os
import urllib
import string
import time
from BeautifulSoup import BeautifulSoup,NavigableString, SoupStrainer
from string import maketrans
import codecs

f=codecs.open('alldetails7.txt', mode='w', encoding='utf-8', errors='replace')
...


soup5 = BeautifulSoup(html5)
enc_s5 = soup5.originalEncoding

for company in iter(soup5.findAll(height="20px")):
    stream = ""
    count_detail = 1
    for tag in iter(company.findAll('td')):
        if count_detail > 1:
           stream = stream + tag.text.replace(u',',u';')
           if count_detail < 4 :
              stream=stream+","
        count_detail = count_detail + 1
    stream.strip()
    try:
        f.write(str(stnum)+","+br_name_addr+","+stream.decode(enc_s5)+os.linesep)
    except:
        print "Unicode error ->"+str(storenum)+","+branch_name_address+","+stream

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

宛菡 2024-10-26 17:09:55

您的 f.write() 行对我来说没有意义 - stream 将是 unicode 因为它是从 间接生成的tag.text 和 BeautifulSoup 为您提供 Unicode，因此您不应该在 stream 上调用 decode。（您使用 decode 将具有特定字符编码的 str 转换为 unicode。）您已使用 codecs.open() 并告诉它使用 UTF-8，因此您只需 write() 一个 unicode 就可以了。因此，我会尝试：

f.write(unicode(stnum)+br_name_addr+u","+stream+os.linesep)

...或者，假设您刚刚使用 f=open('alldetails7.txt','w') 打开文件，您会这样做：

line = unicode(stnum)+br_name_addr+u","+stream+os.linesep
f.write(line.encode('utf-8'))

Your f.write() line doesn't make sense to me - stream will be a unicode since it's made indirectly from from tag.text and BeautifulSoup gives you Unicode, so you shouldn't call decode on stream. (You use decode to turn a str with a particular character encoding into a unicode.) You've opened the file for writing with codecs.open() and told it to use UTF-8, so you can just write() a unicode and that should work. So, instead I would try:

f.write(unicode(stnum)+br_name_addr+u","+stream+os.linesep)

... or, supposing that instead you had just opened the file with f=open('alldetails7.txt','w'), you would do:

line = unicode(stnum)+br_name_addr+u","+stream+os.linesep
f.write(line.encode('utf-8'))

回复收藏 0 原文

长亭外，古道边 2024-10-26 17:09:55

您是否检查了要写入的文件的编码，并确保字符可以以您尝试写入文件的编码显示？尝试将字符编码设置为 UTF-8 或其他明确的编码以显示字符。

回复收藏 0 原文

~没有更多了~

关于作者

秋千易

暂无简介

文章

25 人气

关注发私信

友情链接

文江博客

某些字符（商标符号等）无法写入文件，但可以在屏幕上打印

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

饮湿

明月

02

hs1283

风向决定发型

落花浅忆

友情链接

某些字符（商标符号等）无法写入文件，但可以在屏幕上打印

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

饮湿

明月

02

hs1283

风向决定发型

落花浅忆

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。