如何使用python lxml获取html元素
我有这个 html 代码:
<table>
<tr>
<td class="test"><b><a href="">aaa</a></b></td>
<td class="test">bbb</td>
<td class="test">ccc</td>
<td class="test"><small>ddd</small></td>
</tr>
<tr>
<td class="test"><b><a href="">eee</a></b></td>
<td class="test">fff</td>
<td class="test">ggg</td>
<td class="test"><small>hhh</small></td>
</tr>
</table>
我使用这个 Python 代码使用 lxml 模块提取所有 。
import urllib2
import lxml.html
code = urllib.urlopen("http://www.example.com/page.html").read()
html = lxml.html.fromstring(code)
result = html.xpath('//td[@class="test"][position() = 1 or position() = 4]')
效果很好!结果是:(
<td class="test"><b><a href="">aaa</a></b></td>
<td class="test"><small>ddd</small></td>
<td class="test"><b><a href="">eee</a></b></td>
<td class="test"><small>hhh</small></td>
因此每个 的第一列和第四列) 现在,我必须提取:
aaa(链接标题)
ddd(
标记之间的文本)
eee(链接标题)
hhh(
标记之间的文本)
我如何提取这些值?
(问题是我必须删除 标签并获取第一列上锚点的标题,并删除第四列上的
标签)
谢谢你!
I have this html code:
<table>
<tr>
<td class="test"><b><a href="">aaa</a></b></td>
<td class="test">bbb</td>
<td class="test">ccc</td>
<td class="test"><small>ddd</small></td>
</tr>
<tr>
<td class="test"><b><a href="">eee</a></b></td>
<td class="test">fff</td>
<td class="test">ggg</td>
<td class="test"><small>hhh</small></td>
</tr>
</table>
I use this Python code to extract all <td class="test">
with lxml module.
import urllib2
import lxml.html
code = urllib.urlopen("http://www.example.com/page.html").read()
html = lxml.html.fromstring(code)
result = html.xpath('//td[@class="test"][position() = 1 or position() = 4]')
It works good! The result is:
<td class="test"><b><a href="">aaa</a></b></td>
<td class="test"><small>ddd</small></td>
<td class="test"><b><a href="">eee</a></b></td>
<td class="test"><small>hhh</small></td>
(so the first and the fourth column of each <tr>
)
Now, I have to extract:
aaa (the title of the link)
ddd (text between
<small>
tag)eee (the title of the link)
hhh (text between
<small>
tag)
How could I extract these values?
(the problem is that I have to remove <b>
tag and get the title of the anchor on the first column and remove <small>
tag on the forth column)
Thank you!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果你执行
el.text_content()
,你将从每个元素中删除所有标签内容,即:If you do
el.text_content()
you'll strip all the tag stuff from each element, i.e.:为什么不在每一步中获取你想要的东西呢?
Why dont you just fetch what you want in each step?