尝试抓取页面时出现编码问题
我正在使用 beautifulSoup 来抓取具有 ISO-8859-1 编码的页面,但是我遇到了一些小问题。
我有一行内容如下:
logging.info("Processing [%s]" % (link))
变量 link
是从 beautifulsoup 中抓取的值之一。它是一个 Unicode 字符串,我可以通过输入 print link
来打印它。它在控制台上的显示方式与刮取的方式完全相同,但上面的行引发了此错误:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 14: ordinal not in range(128)
我现在已经阅读了 Unicode,但我不明白为什么它能够打印它但无法记录它。
有问题的字符串是这样的:
booba-concert-à-bercy
关于我在哪里搞砸了这件事有什么想法吗?
谢谢。
I'm using beautifulSoup to scrape a page that has a ISO-8859-1 encoding however I've run into my little hiccup.
I have a line that reads:
logging.info("Processing [%s]" % (link))
The variable link
is one of the values scraped from beautifulsoup. It is a Unicode string and I can print it by typing print link
. It shows up on the console exactly the way it was scraped but the line above throws this error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 14: ordinal not in range(128)
I've read up on Unicode right now but I can't figure out why it is able to print it but it can't log it.
The string in question is this:
booba-concert-à-bercy
Any ideas on where I'm mucking this up?
Thank you.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
logging
不喜欢unicode
;传递字节。logging
doesn't likeunicode
; pass it bytes.我通过在
Python/Lib/site-packages
目录中添加一个名为sitecustomize.py
的文件来解决这个问题。该文件包含两行:import sys
和sys.setdefaultencoding('utf-8')
。在此之前的默认编码是
ascii
,因此存在问题。现在我不需要为链接变量指定显式编码,因为它使用默认编码(即utf-8
)并将其转换为该编码。当然,除非我的终端采用相同的编码,否则我永远不会正确地看到这些字符,但这不会破坏我的代码。
I managed to solve this by adding a file called
sitecustomize.py
in myPython/Lib/site-packages
directory. This file contained two lines:import sys
andsys.setdefaultencoding('utf-8')
.The default encoding prior to that was
ascii
and therefore the issues. Now I don't need to specify an explicit encoding for the link variable as it uses the default encoding i.e.utf-8
and converts it to that.Of course, I'll never see the characters properly until my terminal in the same encoding but that won't break my code.