使用 BeautifulSoup 清理和删除标签

发布于 2024-09-07 17:50:43 字数 715 浏览 0 评论 0原文

到目前为止，我有以下脚本：

from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import re
import urllib2

br = Browser()
br.open("http://www.foo.com")

html = br.response().read(); 

soup = BeautifulSoup(html)
items = soup.findAll(id="info")

它运行完美，并产生以下“项目”：

<div id="info">
<span class="customer"><b>John Doe</b></span><br>
123 Main Street<br>
Phone:5551234<br>
<b><span class="paid">YES</span></b>
</div>

但是，我想获取 items 并将其清理以获取

John Doe
123 Main Street
5551234

如何删除此类标签在 BeautifulSoup 和 Python 中？

一如既往，谢谢！

原文

I have the following script so far:

from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import re
import urllib2

br = Browser()
br.open("http://www.foo.com")

html = br.response().read(); 

soup = BeautifulSoup(html)
items = soup.findAll(id="info")

and it runs perfectly, and results in the following "items":

<div id="info">
<span class="customer"><b>John Doe</b></span><br>
123 Main Street<br>
Phone:5551234<br>
<b><span class="paid">YES</span></b>
</div>

However, I'd like to take items and clean it up to get

John Doe
123 Main Street
5551234

How can you remove such tags in BeautifulSoup and Python?

As always, thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

樱娆 2024-09-14 17:50:43

这将为这个精确的 html 做到这一点。显然，这不能容忍任何偏差，因此您需要添加大量边界检查和空检查，但这里是将数据转换为纯文本的具体细节。

items = soup.findAll(id="info")
print items[0].span.b.contents[0]
print items[0].contents[3].strip()
print items[0].contents[5].strip().split(":", 1)[1]

This will do it for this EXACT html. Obviously this isn't tolerant of any deviation, so you'll want to add quite a lot of bounds checking and null checking, but here's the nuts and bolts to get your data into plain text.

items = soup.findAll(id="info")
print items[0].span.b.contents[0]
print items[0].contents[3].strip()
print items[0].contents[5].strip().split(":", 1)[1]

回复收藏 0 原文

~没有更多了~