需要帮助在 python3 中解析 html,对于 xml.etree.ElementTree 来说格式不够好
我到处都收到不匹配的标签错误。我不知道到底为什么,我觉得 craigslist 主页上的文字看起来不错,但我还没有足够彻底地浏览它。是否有一些我可以使用的更宽容的东西,或者这是我使用标准库解析 html 的最佳选择?
I keep getting mismatched tag errors all over the place. I'm not sure why exactly, it's the text on craigslist homepage which looks fine to me, but I haven't skimmed it thoroughly enough. Is there perhaps something more forgiving I could use or is this my best bet for html parsing with the standard library?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
标签不匹配错误可能是由标签不匹配引起的。浏览器以接受草率的 html 而闻名,并且使网页编码人员很容易编写格式错误的 html,因此有很多这样的 html。没有理由相信 creagslist 应该免受不良网页设计者的影响。
您需要使用允许这些不匹配的语法。如果您使用的解析器不允许您适当地重新定义语法,那么您就会陷入困境。 (可能有更好的Python库,但我不知道)。
一种替代方法是通过 Tidy 之类的工具运行网页,以清除此类不匹配,然后运行解析器。
The mismatched tag errors are likely caused by mismatched tags. Browsers are famous for accepting sloppy html, and have made it easy for web page coders to write badly formed html, so there's a lot of it. THere's no reason to believe that creagslist should be immune to bad web page designers.
You need to use a grammar that allows for these mismatches. If the parser you are using won't let you redefine the grammar appropriately, you are stuck. (There may be a better Python library for this, but I don't know it).
One alternative is to run the web page through a tool like Tidy that cleans up such mismatches, and then run your parser on that.
解析不可预测的 HTML 的最佳库是 BeautifulSoup。以下是项目页面的引用:
然而,Python 3 并没有很好地支持它,链接末尾有更多相关信息。
The best library for parsing unpredictable HTML is BeautifulSoup. Here's a quote from the project page:
However it isn't well-supported for Python 3, there's more information about this at the end of the link.
解析 HTML 并不是一个简单的问题,使用库绝对是解决方案。用于解析格式不正确的 HTML 的两个常见库是 BeautifulSup 和 lxml。
lxml 支持 Python 3,并且它的 HTML 解析器可以很好地处理不可预测的 HTML。它很棒而且速度很快,而且它在底层使用了 c 库。我强烈推荐它。
BeautifulSoup 3.1支持Python 3,但也被认为是一个失败的实验”,并且被告知不要使用它,所以实际上BeautifulSoup还不支持Python 3,lxml是唯一的选择。
Parsing HTML is not an easy problem, using libraries are definitely the solution here. The two common libraries for parsing HTML that isn't well formed are BeautifulSup and lxml.
lxml supports Python 3, and it's HTML parser handles unpredictable HTML well. It's awesome and fast as well as it uses c-libraries in the bottom. I highly recommend it.
BeautifulSoup 3.1 supports Python 3, but is also deemed a failed experiment" and you are told not to use it, so in practice BeautifulSoup doesn't support Python 3 yet, leaving lxml as the only alternative.