某些字符(商标符号等)无法写入文件,但可以在屏幕上打印

发布于 2024-10-19 17:09:55 字数 1110 浏览 1 评论 0原文

我一直在尝试从网站上抓取数据并将找到的数据写入文件。超过 90% 的情况下,我不会遇到 Unicode 错误,但是当数据具有以下字符(例如“Burger King®, Hans Café”)时,它不喜欢将其写入文件,因此我的错误处理会打印它按原样显示在屏幕上,没有任何进一步的错误。

我尝试过编码和解码功能以及各种编码,但无济于事。

请查看我在下面编写的当前代码的摘录:

import urllib2,sys
import re
import os
import urllib
import string
import time
from BeautifulSoup import BeautifulSoup,NavigableString, SoupStrainer
from string import maketrans
import codecs

f=codecs.open('alldetails7.txt', mode='w', encoding='utf-8', errors='replace')
...


soup5 = BeautifulSoup(html5)
enc_s5 = soup5.originalEncoding

for company in iter(soup5.findAll(height="20px")):
    stream = ""
    count_detail = 1
    for tag in iter(company.findAll('td')):
        if count_detail > 1:
           stream = stream + tag.text.replace(u',',u';')
           if count_detail < 4 :
              stream=stream+","
        count_detail = count_detail + 1
    stream.strip()
    try:
        f.write(str(stnum)+","+br_name_addr+","+stream.decode(enc_s5)+os.linesep)
    except:
        print "Unicode error ->"+str(storenum)+","+branch_name_address+","+stream

I've been trying to scrape data from a website and write out the data that I find to a file. More than 90% of the time, I don't run into Unicode errors but when the data has the following characters such as "Burger King®, Hans Café", it doesn't like writing that into the file so my error handling prints it to the screen as is and without any further errors.

I've tried the encode and decode functions and the various encodings but to no avail.

Please find an excerpt of the current code that I've written below:

import urllib2,sys
import re
import os
import urllib
import string
import time
from BeautifulSoup import BeautifulSoup,NavigableString, SoupStrainer
from string import maketrans
import codecs

f=codecs.open('alldetails7.txt', mode='w', encoding='utf-8', errors='replace')
...


soup5 = BeautifulSoup(html5)
enc_s5 = soup5.originalEncoding

for company in iter(soup5.findAll(height="20px")):
    stream = ""
    count_detail = 1
    for tag in iter(company.findAll('td')):
        if count_detail > 1:
           stream = stream + tag.text.replace(u',',u';')
           if count_detail < 4 :
              stream=stream+","
        count_detail = count_detail + 1
    stream.strip()
    try:
        f.write(str(stnum)+","+br_name_addr+","+stream.decode(enc_s5)+os.linesep)
    except:
        print "Unicode error ->"+str(storenum)+","+branch_name_address+","+stream

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

宛菡 2024-10-26 17:09:55

您的 f.write() 行对我来说没有意义 - stream 将是 unicode 因为它是从 间接生成的tag.textBeautifulSoup 为您提供 Unicode,因此您不应该在 stream 上调用 decode。 (您使用 decode 将具有特定字符编码的 str 转换为 unicode。)您已使用 codecs.open() 并告诉它使用 UTF-8,因此您只需 write() 一个 unicode 就可以了。因此,我会尝试:

f.write(unicode(stnum)+br_name_addr+u","+stream+os.linesep)

...或者,假设您刚刚使用 f=open('alldetails7.txt','w') 打开文件,您会这样做:

line = unicode(stnum)+br_name_addr+u","+stream+os.linesep
f.write(line.encode('utf-8'))

Your f.write() line doesn't make sense to me - stream will be a unicode since it's made indirectly from from tag.text and BeautifulSoup gives you Unicode, so you shouldn't call decode on stream. (You use decode to turn a str with a particular character encoding into a unicode.) You've opened the file for writing with codecs.open() and told it to use UTF-8, so you can just write() a unicode and that should work. So, instead I would try:

f.write(unicode(stnum)+br_name_addr+u","+stream+os.linesep)

... or, supposing that instead you had just opened the file with f=open('alldetails7.txt','w'), you would do:

line = unicode(stnum)+br_name_addr+u","+stream+os.linesep
f.write(line.encode('utf-8'))
长亭外,古道边 2024-10-26 17:09:55

您是否检查了要写入的文件的编码,并确保字符可以以您尝试写入文件的编码显示?尝试将字符编码设置为 UTF-8 或其他明确的编码以显示字符。

Have you checked the encoding of the file you're writing to, and made sure the characters can be shown in the encoding you're trying to write to the file? Try setting the character encoding to UTF-8 or something else explicitly to have the characters show up.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文