如何将 BeautifulSoup.ResultSet 转换为字符串

发布于 2024-12-10 03:52:59 字数 609 浏览 0 评论 0原文

因此，我使用 .findAll (BeautifulSoup) 将 html 页面解析为名为 result 的变量。如果我在 Python shell 中输入 result 然后按 Enter，我会看到预期的普通文本，但由于我想将此结果作为字符串对象进行后处理，我注意到 str(result) 返回垃圾，就像这个示例：

\xd1\x87\xd0\xb8\xd0\xbb\xd0\xbd\xd0\xb8\xd1\x86\xd0\xb0</a><br />\n<hr />\n</div>

Html 页面源是 utf-8 编码

我该如何处理这个？

代码基本上是这样的，以防万一：

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib.open(url).read())
result = soup.findAll(something)

Python 是 2.7

原文

So I parsed a html page with .findAll (BeautifulSoup) to variable named result.
If I type result in Python shell then press Enter, I see normal text as expected, but as I wanted to postprocess this result as string object, I noticed that str(result) returns garbage, like this sample:

\xd1\x87\xd0\xb8\xd0\xbb\xd0\xbd\xd0\xb8\xd1\x86\xd0\xb0</a><br />\n<hr />\n</div>

Html page source is utf-8 encoded

How can I handle this?

Code is basically this, in case it matters:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib.open(url).read())
result = soup.findAll(something)

Python is 2.7

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

浅沫记忆 2024-12-17 03:52:59

Python 2.6.7
BeautifulSoup.version 3.2.0

这对我有用：

unicode.join(u'\n',map(unicode,result))

我很确定 result 是一个 BeautifulSoup.ResultSet 对象，这似乎是标准 python 列表的扩展

Python 2.6.7
BeautifulSoup.version 3.2.0

This worked for me:

unicode.join(u'\n',map(unicode,result))

I'm pretty sure a result is a BeautifulSoup.ResultSet object, which seems to be an extension of the standard python list

回复收藏 0 原文

简单爱 2024-12-17 03:52:59

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib.open(url).read())
#findAll should get multiple parsed result
result = soup.findAll(something)
#then iterate result
for line in result:
    #get str value from each line,replace charset with utf-8 or other charset you need
    print line.__str__('charset')

BTW：BeautifulSoup的版本是beautifulsoup-3.2.1

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib.open(url).read())
#findAll should get multiple parsed result
result = soup.findAll(something)
#then iterate result
for line in result:
    #get str value from each line,replace charset with utf-8 or other charset you need
    print line.__str__('charset')

BTW:BeautifulSoup's version is beautifulsoup-3.2.1

回复收藏 0 原文

岁月蹉跎了容颜 2024-12-17 03:52:59

这不是垃圾，而是 UTF-8 编码的文本。改用 Unicode。

回复收藏 0 原文

岁月流歌 2024-12-17 03:52:59

使用这个：

unicodedata.normalize('NFKC', p.decode()).encode('ascii','ignore')

Unicode 有多种标准化形式
该输出不应该是垃圾。
使用originalEncoding属性来验证编码方案。
关于python的unicode实现，请参考此文档（即使是为了规范化）

Use this:

unicodedata.normalize('NFKC', p.decode()).encode('ascii','ignore')

Unicode has multiple normalization forms
That output should not be garbage.
Use the originalEncoding attribute to verify the encoding scheme.
Regarding python's unicode implementations, refer this document (even for the normalization)

回复收藏 0 原文

~没有更多了~