BeautifulSoup 没有给我 Unicode

发布于 2024-09-08 14:36:35 字数 782 浏览 1 评论 0原文

我正在使用 Beautiful soup 来抓取数据。 BS 文档指出 BS 应始终返回 Unicode，但我似乎无法获取 Unicode。这是一个代码片段

import urllib2
from libs.BeautifulSoup import BeautifulSoup

# Fetch and parse the data
url = 'http://wiki.gnhlug.org/twiki2/bin/view/Www/PastEvents2007?skin=print.pattern'

data = urllib2.urlopen(url).read()
print 'Encoding of fetched HTML : %s', type(data)

soup = BeautifulSoup(data)
print 'Encoding of souped up HTML : %s', soup.originalEncoding 

table = soup.table
print type(table.renderContents())

从页面返回的原始数据是一个字符串。 BS 将原始编码显示为 ISO-8859-1。我认为 BS 会自动将所有内容转换为 Unicode，那么为什么当我这样做时：

table = soup.table
print type(table.renderContents())

..它给了我一个字符串对象而不是 Unicode？

我如何从 BS 获取 Unicode 对象？

我真的真的迷失了。有什么帮助吗？提前致谢。

原文

I'm using Beautiful soup to scrape data. The BS documentation states that BS should always return Unicode but I can't seem to get Unicode. Here's a code snippet

import urllib2
from libs.BeautifulSoup import BeautifulSoup

# Fetch and parse the data
url = 'http://wiki.gnhlug.org/twiki2/bin/view/Www/PastEvents2007?skin=print.pattern'

data = urllib2.urlopen(url).read()
print 'Encoding of fetched HTML : %s', type(data)

soup = BeautifulSoup(data)
print 'Encoding of souped up HTML : %s', soup.originalEncoding 

table = soup.table
print type(table.renderContents())

The original data returned from the page is a string. BS shows the original encoding as ISO-8859-1. I thought that BS automatically converted everything to Unicode so why is it that when I do this:

table = soup.table
print type(table.renderContents())

..it gives me a string object and not Unicode?

How can i get a Unicode objects from BS?

I'm really, really lost with this. Any help? Thanks in advance.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

菊凝晚露 2024-09-15 14:36:36

originalEncoding 正是源编码，因此 BS 在内部将所有内容存储为 unicode 的事实不会改变该值。当您遍历树时，所有文本节点都是 unicode，所有标签都是 unicode 等，除非您以其他方式转换它们（例如使用 print、str、美化，或renderContents）。

尝试执行以下操作：

soup = BeautifulSoup(data)
print type(soup.contents[0])

不幸的是，到目前为止您所做的所有其他操作都发现 BS 中转换为字符串的方法非常少。

originalEncoding is exactly that - the source encoding, so the fact that BS is storing everything as unicode internally won't change that value. When you walk the tree, all text nodes are unicode, all tags are in unicode, etc., unless you otherwise convert them (say by using print, str, prettify, or renderContents).

Try doing something like:

soup = BeautifulSoup(data)
print type(soup.contents[0])

Unfortunately everything else you've done up to this point has found the very few methods in BS that convert to strings.

回复收藏 0 原文

温柔女人霸气范 2024-09-15 14:36:35

您可能已经注意到，renderContent 返回（默认情况下）以 UTF-8 编码的字符串，但如果您确实想要表示整个文档的 Unicode 字符串，您也可以执行 unicode(soup) 或使用 unicode( soup.prettify()，“utf-8”）。

关于作者

安静

暂无简介

0 文章

0 评论

22 人气

关注发私信

友情链接

文江博客

BeautifulSoup 没有给我 Unicode

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

lioqio

Single

禾厶谷欠

alipaysp_2zg8elfGgC

qq_N6d4X7

放低过去

友情链接

BeautifulSoup 没有给我 Unicode

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

lioqio

Single

禾厶谷欠

alipaysp_2zg8elfGgC

qq_N6d4X7

放低过去

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。