使用 Lxml 解析 HTML
我需要帮助使用 lxml 解析页面中的一些文本。我尝试了 beautifulsoup,但我正在解析的页面的 html 太糟糕了,它无法工作。所以我已经转向 lxml,但文档有点令人困惑,我希望这里有人可以帮助我。
这里是我试图解析的页面,我需要获取“附加信息”部分下的文本。请注意,我在这个网站上有很多这样的页面需要解析,并且每个页面的 html 并不总是完全相同(可能包含一些额外的空“td”标签)。任何有关如何获取该文本的建议将不胜感激。
感谢您的帮助。
I need help parsing out some text from a page with lxml. I tried beautifulsoup and the html of the page I am parsing is so broken, it wouldn't work. So I have moved on to lxml, but the docs are a little confusing and I was hoping someone here could help me.
Here is the page I am trying to parse, I need to get the text under the "Additional Info" section. Note, that I have a lot of pages on this site like this to parse and each pages html is not always exactly the same (might contain some extra empty "td" tags). Any suggestions as to how to get at that text would be very much appreciated.
Thanks for the help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
产量
编辑:这是基于 Steven D. Majewski 的 xpath 的替代解决方案,它解决了 OP 的评论,即分隔“附加信息”与简介的标签数量可能未知:
yields
Edit: Here is an alternate solution based on Steven D. Majewski's xpath which addresses the OP's comment that the number of tags separating 'Additional Info' from the blurb can be unknown: