html.parser 奇怪的行为

发布于 2025-01-06 12:02:04 字数 1740 浏览 1 评论 0原文

使用Python 3.2,我尝试直接从 html.parser 文档

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag:", tag)
    def handle_endtag(self, tag):
        print("Encountered an end tag :", tag)
    def handle_data(self, data):
        print("Encountered some data  :", data)

parser = MyHTMLParser(strict=False)
parser.feed('<html><head><title>Test</title></head>'
            '<body><h1>Parse me!</h1></body></html>')

我得到的不是文档中显示的结果:

Encountered some data  : <html>
Encountered some data  : <head>
Encountered some data  : <title>
Encountered some data  : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered some data  : <body>
Encountered some data  : <h1>
Encountered some data  : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html

出于某种原因,它会将某些标签视为数据,但前提是strict=False。如果 strict=True 我得到正确的结果:

Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data  : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data  : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html

Using Python 3.2, I attempted the example straight from the html.parser documentation:

from html.parser import HTMLParser

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Encountered a start tag:", tag)
    def handle_endtag(self, tag):
        print("Encountered an end tag :", tag)
    def handle_data(self, data):
        print("Encountered some data  :", data)

parser = MyHTMLParser(strict=False)
parser.feed('<html><head><title>Test</title></head>'
            '<body><h1>Parse me!</h1></body></html>')

Instead of getting the result shown on the documentation i get:

Encountered some data  : <html>
Encountered some data  : <head>
Encountered some data  : <title>
Encountered some data  : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered some data  : <body>
Encountered some data  : <h1>
Encountered some data  : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html

For some reason, it treats some tags as data BUT only if strict=False. If strict=True i get the correct result:

Encountered a start tag: html
Encountered a start tag: head
Encountered a start tag: title
Encountered some data  : Test
Encountered an end tag : title
Encountered an end tag : head
Encountered a start tag: body
Encountered a start tag: h1
Encountered some data  : Parse me!
Encountered an end tag : h1
Encountered an end tag : body
Encountered an end tag : html

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

香草可樂 2025-01-13 12:02:04

这是一个已修复的错误 (http://bugs.python.org/issue13273)。实际上当你查看 http://hg.python.org/ cpython/log/9ce5d456138b/Lib/html/parser.py,有大量关于 Strict=False 问题的日志消息;几乎感觉这仍然应该被视为测试版。

如果您采用该文件的最新版本 (http: //hg.python.org/cpython/raw-file/9ce5d456138b/Lib/html/parser.py)并使用它,至少文档中的示例再次起作用。尽管如此,就我个人而言,我现在还是有点厌倦相信 Strict=False 在“关键应用程序”中工作。

This was a bug that has been fixed (http://bugs.python.org/issue13273). actually when you look at http://hg.python.org/cpython/log/9ce5d456138b/Lib/html/parser.py, there is a whole lot of log messages about problems with Strict=False; it almost feels like this should still be considered beta.

If you take the most recent version of the file (http://hg.python.org/cpython/raw-file/9ce5d456138b/Lib/html/parser.py) and use that, at least the example from the documentation works again. Still, personally I would be a bit weary for trusting Strict=False to work in "critical applications" at the moment.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文