获取的内容使用 python 的标签
假设我将 html 读入我的程序中,如下所示:
<p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html">F/T & P/T Sales Associate - Caliente Fashions</a> - <font size="-1"> (North Vancouver)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817804151.html">IMMEDIATE EMPLOYMENT WANTED!</a> - </p>
<p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817796152.html">TRAVEL AGENT</a> - <font size="-1"> (NORTH VANCOUVER)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/bnc/ret/1817775400.html">Optical Sales Position</a> - <font size="-1"> (New Westminster)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817709780.html">Sales Clerk</a> - <font size="-1"> (Kits)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817676850.html">MARINE SALES</a> - <font size="-1"> (VANCOUVER ( KITS ))</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817608506.html">Retail Sales Associate</a> - <font size="-1"> (Vancouver)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817573985.html">Retail with small parts appliance background</a> - </p>
<p><a href="http://vancouver.en.craigslist.ca/rds/ret/1817540938.html">Manager *Enjoyable work atmosphere</a> - <font size="-1"> (Langley Centre)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/bnc/ret/1817403652.html">Team Member - Retail Store - FT</a> - <font size="-1"> (Burnaby South)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/rds/ret/1817459155.html">STORE MANAGER-SHOE WAREHOUSE</a> - <font size="-1"> (South Surrey-Semiahmoo)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/pml/ret/1817448777.html">Retail Sales</a> - <font size="-1"> (Coquitlam)</font></p>
如何获取文本节点的内容?我最终想要的是在终端中打印类似于此行的内容:
http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html - TRAVEL AGENT
到目前为止,我有以下代码可以很好地提取 href 链接,但我不确定如何提取数据本身。我正在考虑从 sgmllib.py 模块覆盖 handle_data(self, data)
但到目前为止我似乎想不出一种方法来做到这一点。
from sgmllib import SGMLParser
class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []
def start_a(self, attrs):
href = [v for k, v in attrs if k == "href"]
if href:
self.urls.extend(href)
谢谢!
Assuming I have html read into my program like this:
<p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html">F/T & P/T Sales Associate - Caliente Fashions</a> - <font size="-1"> (North Vancouver)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817804151.html">IMMEDIATE EMPLOYMENT WANTED!</a> - </p>
<p><a href="http://vancouver.en.craigslist.ca/nvn/ret/1817796152.html">TRAVEL AGENT</a> - <font size="-1"> (NORTH VANCOUVER)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/bnc/ret/1817775400.html">Optical Sales Position</a> - <font size="-1"> (New Westminster)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817709780.html">Sales Clerk</a> - <font size="-1"> (Kits)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817676850.html">MARINE SALES</a> - <font size="-1"> (VANCOUVER ( KITS ))</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817608506.html">Retail Sales Associate</a> - <font size="-1"> (Vancouver)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/van/ret/1817573985.html">Retail with small parts appliance background</a> - </p>
<p><a href="http://vancouver.en.craigslist.ca/rds/ret/1817540938.html">Manager *Enjoyable work atmosphere</a> - <font size="-1"> (Langley Centre)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/bnc/ret/1817403652.html">Team Member - Retail Store - FT</a> - <font size="-1"> (Burnaby South)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/rds/ret/1817459155.html">STORE MANAGER-SHOE WAREHOUSE</a> - <font size="-1"> (South Surrey-Semiahmoo)</font></p>
<p><a href="http://vancouver.en.craigslist.ca/pml/ret/1817448777.html">Retail Sales</a> - <font size="-1"> (Coquitlam)</font></p>
How do I grab the contents of the text node? What I would like to end up with is printing something similar to this line in the terminal:
http://vancouver.en.craigslist.ca/nvn/ret/1817849271.html - TRAVEL AGENT
So far I have the following code which extracts the href link fine but I'm not sure how to extract the data itself. I'm thinking of overriding handle_data(self, data)
from the sgmllib.py module but so far I can't seem to think of a way to do it.
from sgmllib import SGMLParser
class URLLister(SGMLParser):
def reset(self):
SGMLParser.reset(self)
self.urls = []
def start_a(self, attrs):
href = [v for k, v in attrs if k == "href"]
if href:
self.urls.extend(href)
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
最简单的可能是 BeautifulSoup (请务必使用 3.0.8 或更高版本
3.0. *
版本,不是3.1.*
,除非您使用的是 Python 3 - 请参阅 这里!)。BeautifulSoup 生成 unicode 字符串——如果这是一个问题,请确保按照您希望的方式对它们进行编码,以获得您想要的字节字符串!
Simplest is probably BeautifulSoup (be sure to use 3.0.8 or higher
3.0.*
release, not3.1.*
, unless you're on Python 3 -- see here!).BeautifulSoup produce unicode strings -- if that's a problem, be sure to encode them as you wish to get the byte strings the way you want them!
我个人会使用lxml。安装后,获得您想要的东西很简单:
Personally I would use lxml. Once installed, getting what you want is simple:
SGMLParser 在 Python 2.6 中已被弃用,并将在 3.0 中消失。您可能想改用 HTMLParser 模块。我以前从未使用过它(我总是只使用 BeutifulSoup 来做这些事情),所以我想我应该了解它是如何工作的。这是我整理的示例脚本,应该可以满足您的需求。
输出
更新:经过这个小练习后,界面感觉很粗糙,所以我将坚持使用更干净的BeutifulSoup 库。请参阅 Alex 的示例以了解它是如何完成的。
SGMLParser has been deprecated in Python 2.6, and will go away in 3.0. You probably want to use the HTMLParser module instead. I've never used it before (I always just use BeutifulSoup for these kind of things), so I figured I'd learn how it works. Here's a sample script I put together that should get you what you want.
Output
Update: After going through that little exercise the interface to this just feels gross, so I'm just going to stick with the much cleaner BeutifulSoup library. See Alex's sample to see how it's done.
只要我们比较选项,此 pyparsing 代码片段还会为您提供每个位置的位置,在结束
后面的
标记中给出标签:
给出:
As long as we're comparing options, this pyparsing snippet also gives you the location for each position, given in the
<font>
tag following the closing<a>
tag:Gives: