需要 python lxml 语法帮助来解析 html
我是 python 的新手,我需要一些有关使用 lxml 查找和迭代 html 标签的语法的帮助。 以下是我正在处理的用例:
HTML 文件的格式相当好(但并不完美)。 屏幕上有多个表格,其中一个包含一组搜索结果,每个表格包含页眉和页脚。 每个结果行都包含一个搜索结果详细信息的链接。
我需要找到包含搜索结果行的中间表(我能够弄清楚这个):
self.mySearchTables = self.mySearchTree.findall(".//table") self.myResultRows = self.mySearchTables[1].findall(".//tr")
我需要找到该表格中包含的链接(这是我的位置)我被卡住了):
self.myResultRows 中的 searchRow 的: searchLink = PatentRow.findall(".//a")
它似乎并没有真正找到链接元素。
我需要链接的纯文本。 我想如果我实际上首先获得了链接元素,它会类似于
searchLink.text
。
最后,在 lxml 的实际 API 参考中,我无法找到有关 find 和 findall 调用的信息。 我从谷歌上找到的一些代码中收集了这些。 我是否遗漏了有关如何使用 lxml 有效查找和迭代 HTML 标签的信息?
I am brand new to python, and I need some help with the syntax for finding and iterating through html tags using lxml. Here are the use-cases I am dealing with:
HTML file is fairly well formed (but not perfect). Has multiple tables on screen, one containing a set of search results, and one each for a header and footer. Each result row contains a link for the search result detail.
I need to find the middle table with the search result rows (this one I was able to figure out):
self.mySearchTables = self.mySearchTree.findall(".//table") self.myResultRows = self.mySearchTables[1].findall(".//tr")
I need to find the links contained in this table (this is where I'm getting stuck):
for searchRow in self.myResultRows: searchLink = patentRow.findall(".//a")
It doesn't seem to actually locate the link elements.
I need the plain text of the link. I imagine it would be something like
searchLink.text
if I actually got the link elements in the first place.
Finally, in the actual API reference for lxml, I wasn't able to find information on the find and the findall calls. I gleaned these from bits of code I found on google. Am I missing something about how to effectively find and iterate over HTML tags using lxml?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
好的,首先,关于解析 HTML:如果您遵循 zweiterlinde 和 S.Lott 的建议,至少使用 lxml 中包含的 beautifulsoup。 这样您还将获得良好的 xpath 或 css 选择器界面的好处。
不过,我个人更喜欢 Ian Bicking 的 lxml 中包含的 HTML 解析器。
其次,
.find()
和.findall()
来自lxml试图与ElementTree兼容,这两个方法在ElementTree 中的 XPath 支持。这两个函数相当容易使用,但它们是非常有限的 XPath。 我建议尝试使用完整的 lxml
xpath()
方法 或者,如果您已经熟悉 CSS,请使用cssselect()
方法。以下是一些示例,其中 HTML 字符串解析如下:
使用 css 选择器类,您的程序大致如下所示:
使用 xpath 方法的等效方法是:
Okay, first, in regards to parsing the HTML: if you follow the recommendation of zweiterlinde and S.Lott at least use the version of beautifulsoup included with lxml. That way you will also reap the benefit of a nice xpath or css selector interface.
However, I personally prefer Ian Bicking's HTML parser included in lxml.
Secondly,
.find()
and.findall()
come from lxml trying to be compatible with ElementTree, and those two methods are described in XPath Support in ElementTree.Those two functions are fairly easy to use but they are very limited XPath. I recommend trying to use either the full lxml
xpath()
method or, if you are already familiar with CSS, using thecssselect()
method.Here are some examples, with an HTML string parsed like this:
Using the css selector class your program would roughly look something like this:
The equivalent using xpath method would be:
您没有在该项目中使用Beautiful Soup 是否有原因? 这将使处理不完美的文档变得更加容易。
Is there a reason you're not using Beautiful Soup for this project? It will make dealing with imperfectly formed documents much easier.