Python 中最宽容的 HTML 解析器是什么?
我有一些随机的 HTML,我使用 BeautifulSoup 来解析它,但在大多数情况下(> 70%)它会令人窒息。我尝试使用Beautiful soup 3.0.8和3.2.0(3.1.0以上有一些问题),但结果几乎相同。
我可以从我的脑海中回忆起Python中可用的几个HTML解析器选项:
- BeautifulSoup
- lxml
- pyquery
我打算测试所有这些,但我想知道你的测试中哪一个是最宽容的,甚至可以尝试解析糟糕的HTML 。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
他们都是。我还没有遇到过任何 lxml.html 无法解析的 html 页面。如果您尝试解析的页面上存在 lxml barfs,您始终可以使用一些正则表达式对它们进行预处理,以使 lxml 满意。
lxml 本身相当严格,但
lxml.html
是一个不同的解析器,可以处理非常损坏的 html。对于极其糟糕的 html,lxml 还附带了与 BeautifulSoup 库交互的lxml.html.soupparser
。这里描述了使用 lxml.html 解析损坏的 html 的一些方法: http://lxml.de/elementsoup.html
They all are. I have yet to come across any html page found in the wild that lxml.html couldn't parse. If lxml barfs on the pages you're trying to parse you can always preprocess them using some regexps to keep lxml happy.
lxml itself is fairly strict, but
lxml.html
is a different parser and can deal with very broken html. For extremely brokeh html, lxml also ships withlxml.html.soupparser
which interfaces with the BeautifulSoup library.Some approaches to parsing broken html using lxml.html are described here: http://lxml.de/elementsoup.html
对于无法与其他任何内容一起使用的页面(包含嵌套
With pages that don't work with anything else (those that contain nested
<form>
elements come to mind) I've had success with MinimalSoup and ICantBelieveItsBeautifulSoup. Each can handle certain types of error that the other one can't so often you'll need to try both.我最终使用 BeautifulSoup 4.0 和 html5lib 进行解析,并且更加宽容,对我的代码进行了一些修改,现在它工作得相当好,感谢大家的建议。
I ended up using BeautifulSoup 4.0 with html5lib for parsing and is much more forgiving, with some modifications to my code it's now working considerabily well, thanks all for suggestions.
如果 beautifulsoup 不能解决您的 html 问题,那么下一个最佳解决方案是正则表达式。 lxml、elementtree、minidom 在解析方面非常严格,实际上它们做得正确。
其他提示:
我通过命令提示符将html提供给lynx浏览器,并取出页面/内容的文本版本并使用正则表达式进行解析。
转换为 html 为文本或 html 为 markdown 会去除所有 html 标签,您将保留文本。这很容易解析。
If beautifulsoup doesn't fix your html problem, the next best solution would be regular expression. lxml, elementtree, minidom are very strict in parsing and actually they are doing right.
Other tips:
I feed the html to lynx browser through command prompt, and take out the text version of the page/content and parse using regex.
Converting to html to text or html to markdown strips all the html tags and you will remain with text. That is easy to parse.