使用 BeautifulSoup 清理和删除标签
到目前为止,我有以下脚本:
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import re
import urllib2
br = Browser()
br.open("http://www.foo.com")
html = br.response().read();
soup = BeautifulSoup(html)
items = soup.findAll(id="info")
它运行完美,并产生以下“项目”:
<div id="info">
<span class="customer"><b>John Doe</b></span><br>
123 Main Street<br>
Phone:5551234<br>
<b><span class="paid">YES</span></b>
</div>
但是,我想获取 items 并将其清理以获取
John Doe
123 Main Street
5551234
如何删除此类标签在 BeautifulSoup 和 Python 中?
一如既往,谢谢!
I have the following script so far:
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import re
import urllib2
br = Browser()
br.open("http://www.foo.com")
html = br.response().read();
soup = BeautifulSoup(html)
items = soup.findAll(id="info")
and it runs perfectly, and results in the following "items":
<div id="info">
<span class="customer"><b>John Doe</b></span><br>
123 Main Street<br>
Phone:5551234<br>
<b><span class="paid">YES</span></b>
</div>
However, I'd like to take items and clean it up to get
John Doe
123 Main Street
5551234
How can you remove such tags in BeautifulSoup and Python?
As always, thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这将为这个精确的 html 做到这一点。显然,这不能容忍任何偏差,因此您需要添加大量边界检查和空检查,但这里是将数据转换为纯文本的具体细节。
This will do it for this EXACT html. Obviously this isn't tolerant of any deviation, so you'll want to add quite a lot of bounds checking and null checking, but here's the nuts and bolts to get your data into plain text.