使用lxml代码解析HTML

发布于 2024-10-31 15:57:18 字数 707 浏览 3 评论 0原文

我有以下 HTML 代码：-

<table class="results">
  <tr>
    <td>
      <a href="..">link</a><span>2nd Mar 2011</span><br>XYZ Consultancy Ltd<br>
       <div>....</div>
    </td>
  </tr>
</table>

我正在使用 lxml+python 代码来解析上面的 HTML 文件。我想检索“XYZ Consultancy Ltd”，但我不知道如何执行此操作。到目前为止我的代码如下：-

import lxml.html
for el in root.cssselect("table.results"):    
 for el2 in el: #tr tags
  for e13 in el2:#td tags
     for e14 in e13:
      if ( e14.tag == 'a') :
         print "keyword: ",e14.text_content()
      if (e14.tag == 'span'):
         print "date: ",e14.text_content()

原文

i have following HTML code:-

<table class="results">
  <tr>
    <td>
      <a href="..">link</a><span>2nd Mar 2011</span><br>XYZ Consultancy Ltd<br>
       <div>....</div>
    </td>
  </tr>
</table>

I am using lxml+python code to parse above HTML file. I want to retrieve "XYZ Consultancy Ltd" but I am not able to find out how to do this. So far my code is as follows:-

import lxml.html
for el in root.cssselect("table.results"):    
 for el2 in el: #tr tags
  for e13 in el2:#td tags
     for e14 in e13:
      if ( e14.tag == 'a') :
         print "keyword: ",e14.text_content()
      if (e14.tag == 'span'):
         print "date: ",e14.text_content()

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

在风中等你 2024-11-07 15:57:18

您可以使用 CSS 选择器 +，即直接相邻组合器，获取文本前面的。然后，目标文本包含在其 tail 中属性。

import lxml.html
root = lxml.html.fromstring('''
<table class="results">
  <tr>
    <td>
      <a href="..">link</a><span>2nd Mar 2011</span><br>XYZ Consultancy Ltd<br>
       <div>....</div>
    </td>
  </tr>
</table>
''')
for br_with_tail in root.cssselect('table.results > tr > td > a + span + br'):
    print br_with_tail.tail
    # => XYZ Consultancy Ltd

You can use the CSS Selector +, a direct adjacent combinator, to get at the <br> preceding the text. Then, the target text is contained in its tail attribute.

import lxml.html
root = lxml.html.fromstring('''
<table class="results">
  <tr>
    <td>
      <a href="..">link</a><span>2nd Mar 2011</span><br>XYZ Consultancy Ltd<br>
       <div>....</div>
    </td>
  </tr>
</table>
''')
for br_with_tail in root.cssselect('table.results > tr > td > a + span + br'):
    print br_with_tail.tail
    # => XYZ Consultancy Ltd

回复收藏 0 原文

影子的影子 2024-11-07 15:57:18

一种方法是使用 XPath 查找这样的 a 节点，并检查接下来的两个元素是否为 span 和 br。如果是这样，请查看 br 元素的 tail 属性：

from lxml import etree

data = '''<table class="results">
  <tr>
    <td>
      <a href="..">link</a><span>2nd Mar 2011</span><br>XYZ Consultancy Ltd<br>
       <div>....</div>
    </td>
  </tr>
</table>'''

root = etree.HTML(data)

for e in root.xpath('//table[@class="results"]/tr/td/a'):
    parsed_tag = e.text
    next = e.getnext()
    if next is None or next.tag != 'span':
        continue
    parsed_date = next.text
    next_next = next.getnext()
    if next_next is None or next_next.tag != 'br':
        continue
    print 'tag: ', parsed_tag
    print 'date: ', parsed_date
    print 'company: ', next_next.tail

One way of doing this is to use XPath to find such an a node, and check that the next two elements are span and br. If so, look at the tail attribute of the br element:

from lxml import etree

data = '''<table class="results">
  <tr>
    <td>
      <a href="..">link</a><span>2nd Mar 2011</span><br>XYZ Consultancy Ltd<br>
       <div>....</div>
    </td>
  </tr>
</table>'''

root = etree.HTML(data)

for e in root.xpath('//table[@class="results"]/tr/td/a'):
    parsed_tag = e.text
    next = e.getnext()
    if next is None or next.tag != 'span':
        continue
    parsed_date = next.text
    next_next = next.getnext()
    if next_next is None or next_next.tag != 'br':
        continue
    print 'tag: ', parsed_tag
    print 'date: ', parsed_date
    print 'company: ', next_next.tail

回复收藏 0 原文

~没有更多了~