如何使用 BeautifulSoup 解析包含非 ASCII 字符的 HTML?

发布于 2024-11-24 07:04:59 字数 3060 浏览 0 评论 0原文

当尝试使用 BeautifulSoup 解析某些 html 时,我不断收到以下错误:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0: ordinal not in range(128)

我尝试使用以下问题的解决方案解码 html,但不断收到相同的错误。我已经尝试了以下问题的所有解决方案,但没有一个起作用(发布是为了我不会得到重复的答案,以防它们帮助任何人通过查看问题的相关方法来找到解决方案)。

有人知道我哪里出错了吗?这是 BeautifulSoup 中的错误吗?我应该安装早期版本吗?

编辑:下面的代码和回溯:

from BeautifulSoup import BeautifulSoup as bs
soup = bs(html)

Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 1282, in __init__
    BeautifulStoneSoup.__init__(self, *args, **kwargs)
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 946, in __init__
    self._feed()
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 971, in _feed
    SGMLParser.feed(self, markup)
  File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
    self.goahead(0)
  File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
    self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0: ordinal not in range(128)

编辑:下面每个评论的错误消息:

Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 1282, in __init__
    BeautifulStoneSoup.__init__(self, *args, **kwargs)
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 946, in __init__
    self._feed()
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 971, in _feed
    SGMLParser.feed(self, markup)
  File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
    self.goahead(0)
  File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
    self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0: ordinal not in range(128)

感谢您的帮助!

beautifulsoup 中的“ascii”编解码器错误

UnicodeEncodeError:“ascii”编解码器无法编码位置 0 处的字符 u'\xef':序数不在范围(128)

如何转换文件格式使用 Python 将 Unicode 转换为 ASCII?

python UnicodeEncodeError >如何简单地删除令人烦恼的 unicode 字符?

UnicodeEncodeError: 'ascii' 编解码器无法对位置 0 处的字符 u'\xef' 进行编码:序号不在范围(128)

I keep getting the following error when trying to parse some html using BeautifulSoup:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0: ordinal not in range(128)

I've tried decoding the html using the solution to the questions below, but keep getting the same error. I've tried all the solutions to the questions below but none of them work (posting so that I don't get duplicate answers and in case they help anyone to find a solution by viewing related approaches to the problem).

Anybody know where I'm going wrong here? Is this a bug in BeautifulSoup and should I install an earlier version?

EDIT: code and traceback below:

from BeautifulSoup import BeautifulSoup as bs
soup = bs(html)

Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 1282, in __init__
    BeautifulStoneSoup.__init__(self, *args, **kwargs)
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 946, in __init__
    self._feed()
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 971, in _feed
    SGMLParser.feed(self, markup)
  File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
    self.goahead(0)
  File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
    self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0: ordinal not in range(128)

EDIT: error message per comment below:

Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 1282, in __init__
    BeautifulStoneSoup.__init__(self, *args, **kwargs)
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 946, in __init__
    self._feed()
  File "/var/lib/python-support/python2.5/BeautifulSoup.py", line 971, in _feed
    SGMLParser.feed(self, markup)
  File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
    self.goahead(0)
  File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
    self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xae in position 0: ordinal not in range(128)

Thanks for your help!

'ascii' codec error in beautifulsoup

UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)

How do I convert a file's format from Unicode to ASCII using Python?

python UnicodeEncodeError > How can I simply remove troubling unicode characters?

UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

季末如歌 2024-12-01 07:04:59

你在评论中说:“”“我只是查找了我试图解析的html的内容类型,看看它是否是我没有尝试过的东西(之前我只是假设它是UTF-8)但是可以肯定够了,它是 UTF-8,所以又是一个死胡同。"""

叹息。这正是我一直试图让您泄露您尝试解析的 HTML 的原因。错误消息表明(第一个)问题字节是 \xae,它绝对不是 UTF-8 序列中的有效前导字节。

要么泄露 HTML 的链接,要么进行一些基本的调试:

uc = html.decode('utf8') 有效还是失败?如果失败,有什么错误信息?

您还说:“”“我开始认为这是 BS 中的一个错误,他们在文档中提到了这一点,可以在这里看到:crummy.com/software/BeautifulSoup/CHANGELOG.html。”“”

我无法想象您指的是变更日志中的哪些模糊条目。在急于更新之前考虑调试您的问题。

更新 看起来像是 sgmllib.py 中的一个不起眼的错误。在第 394 行,将 255 更改为 127,它似乎可以工作。极端情况:属性值中的 HTML char ref (®) 与 128 <= ordinal < 255.

进一步评论 不要破解您的 sgmllib.py 副本,而是从 2.7 分支获取最新的 sgmllib.py 副本 - BS 3.0.4 在 Python 2.7.1 上对我来说运行良好。更好的是,将 Python 升级到 2.7。

You say in a comment: """I just looked up the content-type of the html I'm trying to parse to see if it was something I hadn't tried (earlier I just assumed it was UTF-8) but sure enough it was UTF-8 so another dead end."""

Sigh. This is exactly why I have been trying to get you to divulge the HTML that you are trying to parse. The error message indicates that the (first) problem byte is \xae which is definitely NOT a valid lead byte in a UTF-8 sequence.

Either divulge the link to your HTML, or do some basic debugging:

Does uc = html.decode('utf8') work or fail? If fail, with what error message?

You also said: """I'm starting to think this is a bug in BS, which they allude to in the docs, and can be seen here: crummy.com/software/BeautifulSoup/CHANGELOG.html."""

I can't imagine which of the vague entries in the changelog you are referring to. Consider debugging your problem before you rush to update.

Update Looks like an obscure bug in sgmllib.py. In line 394, change 255 to 127 and it appears to work. Corner case: HTML char ref (®) in an attribute value AND with 128 <= ordinal < 255.

Further comments Rather than hack your copy of sgmllib.py, grab a copy of the latest sgmllib.py from the 2.7 branch -- BS 3.0.4 ran OK for me on Python 2.7.1. Even better, upgrade your Python to 2.7.

比忠 2024-12-01 07:04:59

我尝试在 html 上使用 pyquery,结果很好。

import urllib
from pyquery import PyQuery

html = urllib.urlopen('http://www.6pm.com/onitsuka-tiger-by-asics-ultimate-81').read()
pq = PyQuery(html)
print pq('span#price').text() # "$39.00 40% off MSRP $65.00"

pyquery 基于 lxml,因此它也比 beautifulsoup 快得多。

I tried to use pyquery on the html and the result is fine.

import urllib
from pyquery import PyQuery

html = urllib.urlopen('http://www.6pm.com/onitsuka-tiger-by-asics-ultimate-81').read()
pq = PyQuery(html)
print pq('span#price').text() # "$39.00 40% off MSRP $65.00"

pyquery is based on lxml so it's also much faster than beautifulsoup.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文