使用 Lxml 解析 HTML

发布于 2024-09-15 15:53:10 字数 459 浏览 5 评论 0原文

我需要帮助使用 lxml 解析页面中的一些文本。我尝试了 beautifulsoup,但我正在解析的页面的 html 太糟糕了,它无法工作。所以我已经转向 lxml,但文档有点令人困惑,我希望这里有人可以帮助我。

这里是我试图解析的页面,我需要获取“附加信息”部分下的文本。请注意,我在这个网站上有很多这样的页面需要解析,并且每个页面的 html 并不总是完全相同(可能包含一些额外的空“td”标签)。任何有关如何获取该文本的建议将不胜感激。

感谢您的帮助。

I need help parsing out some text from a page with lxml. I tried beautifulsoup and the html of the page I am parsing is so broken, it wouldn't work. So I have moved on to lxml, but the docs are a little confusing and I was hoping someone here could help me.

Here is the page I am trying to parse, I need to get the text under the "Additional Info" section. Note, that I have a lot of pages on this site like this to parse and each pages html is not always exactly the same (might contain some extra empty "td" tags). Any suggestions as to how to get at that text would be very much appreciated.

Thanks for the help.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

妄司 2024-09-22 15:53:10
import lxml.html as lh
import urllib2

def text_tail(node):
    yield node.text
    yield node.tail

url='http://bit.ly/bf1T12'
doc=lh.parse(urllib2.urlopen(url))
for elt in doc.iter('td'):
    text=elt.text_content()
    if text.startswith('Additional  Info'):
        blurb=[text for node in elt.itersiblings('td')
               for subnode in node.iter()
               for text in text_tail(subnode) if text and text!=u'\xa0']
        break
print('\n'.join(blurb))

产量

65 年来,卡尔·斯特恩 (Carl Stirn) 的海军陆战队
一直在设定新的标准
卓越的划船服务
享受。因为我们提供品质
商品、关怀、认真、
销售和服务,我们已经能够
让我们的客户对我们有利
朋友们。

我们的设施占地 26,000 平方英尺,包括
完整的零件和配件
部门, 全方位服务部门
(Merc. Premier 经销商,拥有 2 名全职
Mercruiser Master Tech 的),以及新的,
二手货和经纪销售。

编辑:这是基于 Steven D. Majewski 的 xpath 的替代解决方案,它解决了 OP 的评论,即分隔“附加信息”与简介的标签数量可能未知:

import lxml.html as lh
import urllib2

url='http://bit.ly/bf1T12'
doc=lh.parse(urllib2.urlopen(url))

blurb=doc.xpath('//td[child::*[text()="Additional  Info"]]/following-sibling::td/text()')

blurb=[text for text in blurb if text != u'\xa0']
print('\n'.join(blurb))
import lxml.html as lh
import urllib2

def text_tail(node):
    yield node.text
    yield node.tail

url='http://bit.ly/bf1T12'
doc=lh.parse(urllib2.urlopen(url))
for elt in doc.iter('td'):
    text=elt.text_content()
    if text.startswith('Additional  Info'):
        blurb=[text for node in elt.itersiblings('td')
               for subnode in node.iter()
               for text in text_tail(subnode) if text and text!=u'\xa0']
        break
print('\n'.join(blurb))

yields

For over 65 years, Carl Stirn's Marine
has been setting new standards of
excellence and service for boating
enjoyment. Because we offer quality
merchandise, caring, conscientious,
sales and service, we have been able
to make our customers our good
friends.

Our 26,000 sq. ft. facility includes a
complete parts and accessories
department, full service department
(Merc. Premier dealer with 2 full time
Mercruiser Master Tech's), and new,
used, and brokerage sales.

Edit: Here is an alternate solution based on Steven D. Majewski's xpath which addresses the OP's comment that the number of tags separating 'Additional Info' from the blurb can be unknown:

import lxml.html as lh
import urllib2

url='http://bit.ly/bf1T12'
doc=lh.parse(urllib2.urlopen(url))

blurb=doc.xpath('//td[child::*[text()="Additional  Info"]]/following-sibling::td/text()')

blurb=[text for text in blurb if text != u'\xa0']
print('\n'.join(blurb))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文