从 lxml xpath 查询操作列表
今天我尝试了 lxml,因为我从特定的 Web 服务得到了非常令人讨厌的 html 输出,并且我不想使用 re 模块,只是为了改变和学习新的东西。我确实这样做了,浏览 http://codespeak.net/lxml/ 和 http://stackoverflow.com 同时
我不会尝试解释上面的 html 模板,但只是为了概述它充满了故意嵌套的表格。
我使用 html 解析器提取了感兴趣的部分,然后使用 find_class() 并使用 xpath 迭代 TR(甚至这个 TR 内部也有表)。 现在我试图根据类和 id 属性提取数据对:
- name child has class "title"
- value child has id "text"
代码看起来像这样:
fragment = root.find_class('foo')
for node in fragment[0].xpath('table[2]/tr'):
name = node.xpath('//div[@id="title"]')
value = node.xpath('//td[@class="text"]')
问题是,并不是我正在迭代的每个 TR 都有这些对:有些只有名称(id“标题”),所以稍后当我尝试压缩它们时,我得到了错误的配对数据。
我尝试了一些想到的事情,但没有成功:我尝试比较列表长度(名称和值),如果它们不匹配,则跳过名称查找,如果它们不匹配,则删除最后一个列表项(在很多方面)但没有任何效果。例如:
if not len(name) == len(value):
name.pop()
或者
if len(name) == len(value):
name = node.xpath('//div[@id="title"]')
value = node.xpath('//td[@class="text"]')
更有经验的人的一些评论?
Today I tried lxml as I got very nasty html output from particular web service, and I didn't want to go with re module, just for change and to learn something new. And I did, browsing http://codespeak.net/lxml/ and http://stackoverflow.com in parallel
I won't try to explain above html template, but just for overview it's full of deliberately nested tables.
I extracted part of interest with html parser then find_class() and iterating through TR with xpath (and even this TRs have tables inside).
Now I'm trying to extract data pairs based on class and id attributes:
- name child has class "title"
- value child has id "text"
Code looks something like this:
fragment = root.find_class('foo')
for node in fragment[0].xpath('table[2]/tr'):
name = node.xpath('//div[@id="title"]')
value = node.xpath('//td[@class="text"]')
Problem is that not every TR, that I'm iterating, has those pairs: some are only with name (id "title") so later when I try to zip them I get wrongly paired data.
I tried couple of things that came to my mind but nothing successful: I tried to compare list length (for name and value) and if they don't match skip name lookup, then if they don't match, delete last list item (in many ways) but nothing worked. For example:
if not len(name) == len(value):
name.pop()
or
if len(name) == len(value):
name = node.xpath('//div[@id="title"]')
value = node.xpath('//td[@class="text"]')
Some comments from more experienced?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
这怎么样?
Yielding:
更新,扩展了 xpath 表达式的前导部分以消除不需要的结果。感谢 Alejandro 指出了这一点并提出了一个似乎对 otrov 不起作用的修复方案。
How's this?
Yielding:
Update, extended the leading portion of the xpath expression to eliminate an undesired result. Thanks to Alejandro for pointing this out and suggesting a fix that didn't seem to work out for otrov.
现在,有了输入样本,您的要求就更清楚了。
仅此一个 XPath 1.0 表达式返回一个由
div
和td
对(按文档顺序)设置的节点:作为证明,此样式表:
输出(带有正确的输入示例,因为您错过了收盘
td
):Now, with input sample, is more clear what you are asking.
Just this one XPath 1.0 expression return a node set with
div
andtd
pair (in document order):As proof, this stylesheet:
Output (with proper input sample, because you miss a closing
td
):