需要 python lxml 语法帮助来解析 html

发布于 2024-07-14 19:48:43 字数 783 浏览 4 评论 0原文

我是 python 的新手,我需要一些有关使用 lxml 查找和迭代 html 标签的语法的帮助。 以下是我正在处理的用例:

HTML 文件的格式相当好(但并不完美)。 屏幕上有多个表格,其中一个包含一组搜索结果,每个表格包含页眉和页脚。 每个结果行都包含一个搜索结果详细信息的链接。

  1. 我需要找到包含搜索结果行的中间表(我能够弄清楚这个):

     self.mySearchTables = self.mySearchTree.findall(".//table") 
          self.myResultRows = self.mySearchTables[1].findall(".//tr") 
      
  2. 我需要找到该表格中包含的链接(这是我的位置)我被卡住了):

    self.myResultRows 中的 searchRow 的
    : 
              searchLink = PatentRow.findall(".//a") 
      

    它似乎并没有真正找到链接元素。

  3. 我需要链接的纯文本。 我想如果我实际上首先获得了链接元素,它会类似于 searchLink.text

最后,在 lxml 的实际 API 参考中,我无法找到有关 find 和 findall 调用的信息。 我从谷歌上找到的一些代码中收集了这些。 我是否遗漏了有关如何使用 lxml 有效查找和迭代 HTML 标签的信息?

I am brand new to python, and I need some help with the syntax for finding and iterating through html tags using lxml. Here are the use-cases I am dealing with:

HTML file is fairly well formed (but not perfect). Has multiple tables on screen, one containing a set of search results, and one each for a header and footer. Each result row contains a link for the search result detail.

  1. I need to find the middle table with the search result rows (this one I was able to figure out):

        self.mySearchTables = self.mySearchTree.findall(".//table")
        self.myResultRows = self.mySearchTables[1].findall(".//tr")
    
  2. I need to find the links contained in this table (this is where I'm getting stuck):

        for searchRow in self.myResultRows:
            searchLink = patentRow.findall(".//a")
    

    It doesn't seem to actually locate the link elements.

  3. I need the plain text of the link. I imagine it would be something like searchLink.text if I actually got the link elements in the first place.

Finally, in the actual API reference for lxml, I wasn't able to find information on the find and the findall calls. I gleaned these from bits of code I found on google. Am I missing something about how to effectively find and iterate over HTML tags using lxml?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

逆夏时光 2024-07-21 19:48:43

好的,首先,关于解析 HTML:如果您遵循 zweiterlinde 和 S.Lott 的建议,至少使用 lxml 中包含的 beautifulsoup。 这样您还将获得良好的 xpath 或 css 选择器界面的好处。

不过,我个人更喜欢 Ian Bicking 的 lxml 中包含的 HTML 解析器

其次,.find().findall()来自lxml试图与ElementTree兼容,这两个方法在ElementTree 中的 XPath 支持

这两个函数相当容易使用,但它们是非常有限的 XPath。 我建议尝试使用完整的 lxml xpath() 方法 或者,如果您已经熟悉 CSS,请使用 cssselect()方法

以下是一些示例,其中 HTML 字符串解析如下:

from lxml.html import fromstring
mySearchTree = fromstring(your_input_string)

使用 css 选择器类,您的程序大致如下所示:

# Find all 'a' elements inside 'tr' table rows with css selector
for a in mySearchTree.cssselect('tr a'):
    print 'found "%s" link to href "%s"' % (a.text, a.get('href'))

使用 xpath 方法的等效方法是:

# Find all 'a' elements inside 'tr' table rows with xpath
for a in mySearchTree.xpath('.//tr/*/a'):
    print 'found "%s" link to href "%s"' % (a.text, a.get('href'))

Okay, first, in regards to parsing the HTML: if you follow the recommendation of zweiterlinde and S.Lott at least use the version of beautifulsoup included with lxml. That way you will also reap the benefit of a nice xpath or css selector interface.

However, I personally prefer Ian Bicking's HTML parser included in lxml.

Secondly, .find() and .findall() come from lxml trying to be compatible with ElementTree, and those two methods are described in XPath Support in ElementTree.

Those two functions are fairly easy to use but they are very limited XPath. I recommend trying to use either the full lxml xpath() method or, if you are already familiar with CSS, using the cssselect() method.

Here are some examples, with an HTML string parsed like this:

from lxml.html import fromstring
mySearchTree = fromstring(your_input_string)

Using the css selector class your program would roughly look something like this:

# Find all 'a' elements inside 'tr' table rows with css selector
for a in mySearchTree.cssselect('tr a'):
    print 'found "%s" link to href "%s"' % (a.text, a.get('href'))

The equivalent using xpath method would be:

# Find all 'a' elements inside 'tr' table rows with xpath
for a in mySearchTree.xpath('.//tr/*/a'):
    print 'found "%s" link to href "%s"' % (a.text, a.get('href'))
还在原地等你 2024-07-21 19:48:43

您没有在该项目中使用Beautiful Soup 是否有原因? 这将使处理不完美的文档变得更加容易。

Is there a reason you're not using Beautiful Soup for this project? It will make dealing with imperfectly formed documents much easier.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文