如何使用python lxml获取html元素

发布于 2024-08-31 16:05:57 字数 1704 浏览 6 评论 0原文

我有这个 html 代码：

<table>
 <tr>
  <td class="test"><b><a href="">aaa</a></b></td>
  <td class="test">bbb</td>
  <td class="test">ccc</td>
  <td class="test"><small>ddd</small></td>
 </tr>
 <tr>
  <td class="test"><b><a href="">eee</a></b></td>
  <td class="test">fff</td>
  <td class="test">ggg</td>
  <td class="test"><small>hhh</small></td>
 </tr>
</table>

我使用这个 Python 代码使用 lxml 模块提取所有。

import urllib2
import lxml.html

code   = urllib.urlopen("http://www.example.com/page.html").read()
html   = lxml.html.fromstring(code)
result = html.xpath('//td[@class="test"][position() = 1 or position() = 4]')

效果很好！结果是：（

<td class="test"><b><a href="">aaa</a></b></td>
<td class="test"><small>ddd</small></td>


<td class="test"><b><a href="">eee</a></b></td>
<td class="test"><small>hhh</small></td>

因此每个的第一列和第四列）现在，我必须提取：

aaa（链接标题）
ddd（标记之间的文本）
eee（链接标题）
hhh（标记之间的文本）

我如何提取这些值？

（问题是我必须删除 标签并获取第一列上锚点的标题，并删除第四列上的标签）

谢谢你！

原文

I have this html code:

<table>
 <tr>
  <td class="test"><b><a href="">aaa</a></b></td>
  <td class="test">bbb</td>
  <td class="test">ccc</td>
  <td class="test"><small>ddd</small></td>
 </tr>
 <tr>
  <td class="test"><b><a href="">eee</a></b></td>
  <td class="test">fff</td>
  <td class="test">ggg</td>
  <td class="test"><small>hhh</small></td>
 </tr>
</table>

I use this Python code to extract all <td class="test"> with lxml module.

import urllib2
import lxml.html

code   = urllib.urlopen("http://www.example.com/page.html").read()
html   = lxml.html.fromstring(code)
result = html.xpath('//td[@class="test"][position() = 1 or position() = 4]')

It works good! The result is:

<td class="test"><b><a href="">aaa</a></b></td>
<td class="test"><small>ddd</small></td>


<td class="test"><b><a href="">eee</a></b></td>
<td class="test"><small>hhh</small></td>

(so the first and the fourth column of each <tr>)
Now, I have to extract:

aaa (the title of the link)
ddd (text between <small> tag)
eee (the title of the link)
hhh (text between <small> tag)

How could I extract these values?

(the problem is that I have to remove <b> tag and get the title of the anchor on the first column and remove <small> tag on the forth column)

Thank you!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

十六岁半 2024-09-07 16:05:57

如果你执行el.text_content()，你将从每个元素中删除所有标签内容，即：

result = [el.text_content() for el in result]

If you do el.text_content() you'll strip all the tag stuff from each element, i.e.:

result = [el.text_content() for el in result]

回复收藏 0 原文

木槿暧夏七纪年 2024-09-07 16:05:57

为什么不在每一步中获取你想要的东西呢？

links = [el.text for el in html.xpath('//td[@class="test"][position() = 1]/b/a')]
smalls = [el.text for el in html.xpath('//td[@class="test"][position() = 4]/small')]
print zip(links, smalls) 
# => [('aaa', 'ddd'), ('eee', 'hhh')]

Why dont you just fetch what you want in each step?

links = [el.text for el in html.xpath('//td[@class="test"][position() = 1]/b/a')]
smalls = [el.text for el in html.xpath('//td[@class="test"][position() = 4]/small')]
print zip(links, smalls) 
# => [('aaa', 'ddd'), ('eee', 'hhh')]

回复收藏 0 原文

~没有更多了~