某些字符(商标符号等)无法写入文件,但可以在屏幕上打印
我一直在尝试从网站上抓取数据并将找到的数据写入文件。超过 90% 的情况下,我不会遇到 Unicode 错误,但是当数据具有以下字符(例如“Burger King®, Hans Café”)时,它不喜欢将其写入文件,因此我的错误处理会打印它按原样显示在屏幕上,没有任何进一步的错误。
我尝试过编码和解码功能以及各种编码,但无济于事。
请查看我在下面编写的当前代码的摘录:
import urllib2,sys
import re
import os
import urllib
import string
import time
from BeautifulSoup import BeautifulSoup,NavigableString, SoupStrainer
from string import maketrans
import codecs
f=codecs.open('alldetails7.txt', mode='w', encoding='utf-8', errors='replace')
...
soup5 = BeautifulSoup(html5)
enc_s5 = soup5.originalEncoding
for company in iter(soup5.findAll(height="20px")):
stream = ""
count_detail = 1
for tag in iter(company.findAll('td')):
if count_detail > 1:
stream = stream + tag.text.replace(u',',u';')
if count_detail < 4 :
stream=stream+","
count_detail = count_detail + 1
stream.strip()
try:
f.write(str(stnum)+","+br_name_addr+","+stream.decode(enc_s5)+os.linesep)
except:
print "Unicode error ->"+str(storenum)+","+branch_name_address+","+stream
I've been trying to scrape data from a website and write out the data that I find to a file. More than 90% of the time, I don't run into Unicode errors but when the data has the following characters such as "Burger King®, Hans Café", it doesn't like writing that into the file so my error handling prints it to the screen as is and without any further errors.
I've tried the encode and decode functions and the various encodings but to no avail.
Please find an excerpt of the current code that I've written below:
import urllib2,sys
import re
import os
import urllib
import string
import time
from BeautifulSoup import BeautifulSoup,NavigableString, SoupStrainer
from string import maketrans
import codecs
f=codecs.open('alldetails7.txt', mode='w', encoding='utf-8', errors='replace')
...
soup5 = BeautifulSoup(html5)
enc_s5 = soup5.originalEncoding
for company in iter(soup5.findAll(height="20px")):
stream = ""
count_detail = 1
for tag in iter(company.findAll('td')):
if count_detail > 1:
stream = stream + tag.text.replace(u',',u';')
if count_detail < 4 :
stream=stream+","
count_detail = count_detail + 1
stream.strip()
try:
f.write(str(stnum)+","+br_name_addr+","+stream.decode(enc_s5)+os.linesep)
except:
print "Unicode error ->"+str(storenum)+","+branch_name_address+","+stream
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您的
f.write()
行对我来说没有意义 -stream
将是unicode
因为它是从间接生成的tag.text
和 BeautifulSoup 为您提供 Unicode,因此您不应该在stream
上调用decode
。 (您使用decode
将具有特定字符编码的str
转换为unicode
。)您已使用codecs.open()
并告诉它使用 UTF-8,因此您只需write()
一个unicode
就可以了。因此,我会尝试:...或者,假设您刚刚使用
f=open('alldetails7.txt','w')
打开文件,您会这样做:Your
f.write()
line doesn't make sense to me -stream
will be aunicode
since it's made indirectly from fromtag.text
and BeautifulSoup gives you Unicode, so you shouldn't calldecode
onstream
. (You usedecode
to turn astr
with a particular character encoding into aunicode
.) You've opened the file for writing withcodecs.open()
and told it to use UTF-8, so you can justwrite()
aunicode
and that should work. So, instead I would try:... or, supposing that instead you had just opened the file with
f=open('alldetails7.txt','w')
, you would do:您是否检查了要写入的文件的编码,并确保字符可以以您尝试写入文件的编码显示?尝试将字符编码设置为 UTF-8 或其他明确的编码以显示字符。
Have you checked the encoding of the file you're writing to, and made sure the characters can be shown in the encoding you're trying to write to the file? Try setting the character encoding to UTF-8 or something else explicitly to have the characters show up.