如何将 BeautifulSoup.ResultSet 转换为字符串
因此,我使用 .findAll
(BeautifulSoup) 将 html 页面解析为名为 result
的变量。 如果我在 Python shell 中输入 result
然后按 Enter,我会看到预期的普通文本,但由于我想将此结果作为字符串对象进行后处理,我注意到 str(result)
返回垃圾,就像这个示例:
\xd1\x87\xd0\xb8\xd0\xbb\xd0\xbd\xd0\xb8\xd1\x86\xd0\xb0</a><br />\n<hr />\n</div>
Html 页面源是 utf-8
编码
我该如何处理这个?
代码基本上是这样的,以防万一:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib.open(url).read())
result = soup.findAll(something)
Python 是 2.7
So I parsed a html page with .findAll
(BeautifulSoup) to variable named result
.
If I type result
in Python shell then press Enter, I see normal text as expected, but as I wanted to postprocess this result as string object, I noticed that str(result)
returns garbage, like this sample:
\xd1\x87\xd0\xb8\xd0\xbb\xd0\xbd\xd0\xb8\xd1\x86\xd0\xb0</a><br />\n<hr />\n</div>
Html page source is utf-8
encoded
How can I handle this?
Code is basically this, in case it matters:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib.open(url).read())
result = soup.findAll(something)
Python is 2.7
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
Python 2.6.7
BeautifulSoup.version 3.2.0
这对我有用:
我很确定
result
是一个BeautifulSoup.ResultSet
对象,这似乎是标准 python 列表的扩展Python 2.6.7
BeautifulSoup.version 3.2.0
This worked for me:
I'm pretty sure a
result
is aBeautifulSoup.ResultSet
object, which seems to be an extension of the standard python listBTW:BeautifulSoup的版本是beautifulsoup-3.2.1
BTW:BeautifulSoup's version is beautifulsoup-3.2.1
这不是垃圾,而是 UTF-8 编码的文本。 改用 Unicode。
That's not garbage, that's UTF-8-encoded text. Use Unicode instead.
使用这个:
Unicode 有多种标准化形式
该输出不应该是垃圾。
使用
originalEncoding
属性来验证编码方案。关于python的unicode实现,请参考此文档(即使是为了规范化)
Use this:
Unicode has multiple normalization forms
That output should not be garbage.
Use the
originalEncoding
attribute to verify the encoding scheme.Regarding python's unicode implementations, refer this document (even for the normalization)