如何在 beautifulsoup 中打开带有 windows-1252 编码的 HTML 页面

发布于 2024-11-07 03:23:54 字数 606 浏览 0 评论 0原文

我尝试用 beautifulsoup 解析 HTML 文档，但遇到了麻烦。打开使用 windows-1252 编码的 HTML 文档的最佳方法是什么？

我尝试使用 iconv 转换为 utf-8 但这也不起作用。

doc = open("e.html").read()

soup = BeautifulSoup(doc)

soup.findAll('p')

UnicodeEncodeError: 'ascii' 编解码器无法对位置 103 中的字符 u'\xfc' 进行编码：序数不在范围内(128)

当我在没有 iconv 的情况下打开它时，我收到相同的错误。

完整回溯：

>>> soup.findAll('p')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 103: ordinal not in range(128)

原文

I try to parse a HTML document with beautifulsoup but I run in troubles. What is the best way to open a HTML document with windows-1252 encoding?

I tried with iconv to convert to utf-8 but this also doesn't work.

doc = open("e.html").read()

soup = BeautifulSoup(doc)

soup.findAll('p')

UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 103: ordinal not in range(128)

When I open it without iconv I get the same error.

full traceback:

>>> soup.findAll('p')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 103: ordinal not in range(128)

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

半边脸i 2024-11-14 03:23:54

我遇到了类似的错误：

UnicodeDecodeError：“utf-8”编解码器无法在位置解码字节 0xe9
723617：无效的连续字节

对我有用的是指定输入编码，如下所示：

page = open("page.html", encoding="windows-1252")

soup = BeautifulSoup(page.read(), "html.parser")

I was getting a similar error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position
723617: invalid continuation byte

What worked for me was to specify the input encoding like so:

page = open("page.html", encoding="windows-1252")

soup = BeautifulSoup(page.read(), "html.parser")

回复收藏 0 原文

始终不够 2024-11-14 03:23:54

尝试这样的事情：

doc = open("e.html").read()

doc = doc.decode('cp1252')

soup = BeautifulSoup(doc)

soup.findAll('p')

Try something like this:

doc = open("e.html").read()

doc = doc.decode('cp1252')

soup = BeautifulSoup(doc)

soup.findAll('p')

回复收藏 0 原文

~没有更多了~

关于作者

缱倦旧时光

暂无简介

0 文章

0 评论

22 人气

关注发私信

胡图图

文章 0 评论 0

关注

zt006

文章 0 评论 0

关注

z祗昰~

文章 0 评论 0

关注

冰葑

文章 0 评论 0

关注

野の

文章 0 评论 0

关注

天空

文章 0 评论 0

友情链接

文江博客

如何在 beautifulsoup 中打开带有 windows-1252 编码的 HTML 页面

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

如何在 beautifulsoup 中打开带有 windows-1252 编码的 HTML 页面

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。