HTML 是一种上下文无关语言吗？

发布于 2024-10-20 14:06:03 字数 805 浏览 11 评论 0原文

阅读一些相关问题让我思考了 HTML 的理论本质。

我在这里讨论的不是类似 XHTML 的代码。我正在谈论像这段疯狂的标记这样的东西，它是完全有效的 HTML（！）

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html<head>
<title//
<p ltr<span id=p></span</p>
</>

因此，考虑到 SGML 在这里注入的巨大复杂性，HTML 是一种上下文无关语言吗？无论如何，它是一种正式语言吗？有语法吗？

HTML5 怎么样？

<子> 我对形式语言的概念很陌生，所以请耐心等待。是的，我已经阅读了维基百科文章；）

原文

Reading some related questions made me think about the theoretical nature of HTML.

I'm not talking about XHTML-like code here. I'm talking about stuff like this crazy piece of markup, which is perfectly valid HTML(!)

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html<head>
<title//
<p ltr<span id=p></span</p>
</>

So given the enormous complexity that SGML injects here, is HTML a context-free language? Is it a formal language anyway? With a grammar?

What about HTML5?

_{I'm new to the concept of formal languages, so please bear with me. And yes, I have read the wikipedia article ;)}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

落在眉间の轻吻 2024-10-27 14:06:03

上下文无关是语言理论中的一个概念，对解析器的实现具有重要意义。 上下文无关语言可以通过上下文无关语法来描述，其中所有规则在箭头左侧都有一个非终结符：

X→δ

这个简单限制允许 X 被规则的右侧替换，其中出现在左侧，而不考虑之前或之后的内容。例如，如果在推导或解析时得出：

αXλ

则确定这

αδλ

也是有效的。非上下文无关规则的示例如下：

XY→δ
Xa→δ
aX→δ

那些需要知道围绕 X 可以导出什么来确定规则是否适用，这会导致不确定性（X< 周围是什么） /code> 还想知道它派生的是什么），这在解析中是禁忌，无论如何我们都希望语言能够被明确定义。

证明一种语言是上下文无关的唯一方法是证明它存在上下文无关语法，这不是一件容易的事。大多数出现的编程语言都已经由 CFG 进行了描述，所以工作就完成了。但还有其他语言，包括编程语言，是使用逻辑或简单英语进行描述的，因此需要进行工作来确定它们是否是上下文无关的。

对于 HTML 来说，其上下文无关性的答案是肯定的。 SGML 是一种定义良好的上下文无关语言，定义在其之上的 HTML 也是一种 CFL。这两种语言的解析器和语法在网络上比比皆是。无论如何，存在有效的 LL(k) 语法 HTML 足以证明该语言是上下文无关的，因为 LL 是 CF 的经过验证的子集。

但 HTML 在 Web 生命周期中的演变方式迫使浏览器认为它的定义并不明确。现代网络浏览器会不遗余力地尝试从他们发现的几乎所有内容中呈现出合理的内容。他们使用的语法不是 CFG，并且解析器比 SGML/HTML 所需的解析器复杂得多。

HTML 是在多个级别上定义的。

在词汇层面，有有效字符、标识符、字符串等的规则。
下一个级别是 XML，它由定义分层文档结构的开始和结束组成。您可以将 XML 或类似 XML 的内容用于任何目的，就像 Apache Ant 用于构建脚本一样。
下一个级别是 HTML 中有效的标签，以及有关哪些标签可以嵌套在哪些标签中的规则。
下一个级别是有关哪些属性对哪些标签有效的规则，以及可以嵌入 HTML 中的语言（如 CSS 和 JavaScript）。
最后，您拥有有关给定 HTML 文档含义的语义规则。

语法部分定义得足够好，可以验证。语义部分比语法部分大得多，并且是根据有关 HTTP 的浏览器操作以及文档来定义的对象模型 (DOM)，以及如何将模型渲染到屏幕上。

最后：

解析正确的 HTML 非常容易（它是上下文无关的并且 LL/LR）。
解析 Web 上实际存在的 HTML 很困难。
通过 HTML/CSS/DOM 实现语义（浏览器）非常困难。

Context Free is a concept from language theory that has important implications in parser implementation. A Context Free Language can be described by a Context Free Grammar, which is one in which all rules have a single non-terminal symbol at the left of the arrow:

X→δ

That simple restriction allows X to be substituted by the right-hand side of the rules in which appears on the left without regard to what came before or after. For example, if while deriving or parsing one arrives at:

αXλ

one is sure that

αδλ

is also valid. Examples of non-context-free rules would be:

XY→δ
Xa→δ
aX→δ

Those would require knowing what could be derive arround X to determine if a rule applies, and that leads to non-determinism (what's around X would also like to know what it derives to), which is a no-no in parsing, and in any case we want a language to be well-defined.

The only way to prove that a language is context-free is by proving that there's a context-free grammar for it, which is not an easy task. Most programming languages one comes about are already described by CFGs, so the job is done. But there are other languages, including programming languages, that are described using logic or plain English, so work is required to find if they are context-free.

For HTML, the answer about its context-freedom is yes. SGML is a well defined Context Free Language, and HTML defined on top of it is also a CFL. Parsers and grammars for both languages abound on the Web. At any rate, that there exist LL(k) grammars for valid HTML is enough proof that the language is context-free, because LL is a proven subset of CF.

But the way HTML evolved over the life of the Web forced browsers to treat it as not that well defined. Modern Web browsers will go out of their way to try to render something sensible out of almost anything they find. The grammars they use are not CFGs, and the parsers are far more complex than the ones required for SGML/HTML.

HTML is defined at several levels.

At the lexical level there are the rules for valid characters, identifiers, strings, and so on.
At the next level is XML, which consists of the opening and closing <tags> that define a hierarchical document structure. You can use XML or something XML-like for any purpose, like Apache Ant does for build scripts.
At the next level are the tags that are valid in HTML, and the rules about which tags may be nested within which tags.
At the next level are the rules about which attributes are valid for which tags, languages that can be embedded in HTML like CSS and JavaScript.
Finally, you have the semantic rules about what a given HTML document means.

The syntactic part is defined well enough that it can be verified. The semantic part is much larger than the syntactic one, and is defined in terms of browser actions regarding HTTP, and the Document Object Model (DOM), and how a model should be rendered to the screen.

In the end:

Parsing correct HTML is extremely easy (it's context-free and LL/LR).
Parsing the HTML that actually exists over the Web is difficult.
Implementing the semantics (a browser) over HTML/CSS/DOM is extremely difficult.

回复收藏 0 原文

掩于岁月 2024-10-27 14:06:03

有效的 HTML 不是上下文无关的语言。

首先，HTML 作为 SGML 的应用程序对于所有实际目的来说都是虚构的，因此分析 SGML 来回答这个问题是没有用的。（但是，SGML 小说可能也不是上下文无关的。）

查看实际定义的 HTML 解析算法更有用。它在两个层面上工作：标记化和树构建。 HTML 所谓的标记化是比解析器中通常所说的标记化更高级别的操作。就 HTML 而言，标记化将字符流拆分为开始标记、结束标记、注释和文本等单元。分词器扩展了字符引用。通常，在谈论解析器时，您可能会将小于号之类的东西视为“标记”，并且会认为字符引用由标记组成，而不是由标记生成器解析。

如果您考虑将输入流拆分为标记的过程，则 HTML 语言的该级别是常规的（除了来自树构建器的反馈）。

然而，存在三个复杂性：第一个是将输入流分割成令牌只是第一个，然后树构建器方面实际上关心令牌中的标识符。第二个是树构建器反馈到分词器中，以便分词器所做的一些状态转换取决于树构建器的状态！第三个是语言中的有效文档是由适用于树构建器阶段的输出的规则定义的，这些规则足够复杂，无法使用树自动机完全定义（如 RELAX NG 不具有表达能力所证明）足以描述所有有效性约束）。

这不是一个实际的证明，但您可能可以通过复杂性#2 和#3 来开发真正的证明。

请注意，无效文档的情况并不是特别有趣，因为语言是否是上下文无关的问题，因为存在上下文无关语法，可以生成所有可能的字符串，而不考虑解析树是否具有某种可理解的解释就 HTML 解析器生成的树而言。 HTML 解析器将成功消耗所有可能的字符串，因此从这个意义上说，所有可能的字符串都采用“无效 HTML”语言。

编辑：留给读者练习的有趣问题：

没有解析错误但忽略有效性的 HTML 是上下文无关语言吗？

没有解析错误且忽略一般有效性但仅具有有效元素名称的 HTML 是否允许上下文无关语言？

（并发症 #2 在这两种情况下都适用。）

Valid HTML is not a context-free language.

First of all, HTML being an application of SGML is fiction for all practical purposes, so analyzing SGML to answer the question is useless. (However, the SGML fiction probably isn't context-free, either.)

It's more useful to look at the actually defined HTML parsing algorithm. It works on two levels: tokenization and tree building. What HTML calls tokenization is a higher-level operation than what is usually called tokenization when talking about parsers. In the case of HTML, tokenization splits a stream of characters into units like start tags, end tags, comments and text. The tokenizer expands character references. Usually, when talking about parsers, you'd probably treat stuff like the less-than sign as "tokens" and would consider character references to consist of tokens instead of being resolved by the tokenizer.

If you consider the process of splitting the input stream into tokens, that level of the HTML language is regular (except for feedback from the tree builder).

However, there are three complications: The first one is that splitting the input stream into tokens is just the first and then there's the tree builder's side that actually cares about the identifiers in the tokens. The second one is that the tree builder feeds back into the tokenizer so that some state transitions made by the tokenizer depend on the state of the tree builder! The third one is that valid documents in the language are defined by rules that apply to the output of the tree builder stage and those rules are complex enough that they can't be fully defined using tree automata (as evidenced by RELAX NG not being expressive enough to describe all the validity constraints).

This isn't an actual proof, but you can probably develop real proofs by working from complications #2 and #3.

Note that the case of invalid documents is not particularly interesting as a question of whether the language is context-free in the sense of there being a context-free grammar that generates all the possible strings with no regard to the parse tree having some intelligible interpretation in terms of the tree that an HTML parser generates. The HTML parser will successfully consume all possible strings, so in that sense, all possible strings are in the "invalid HTML" language.

Edit: Interesting questions left as exercise to the reader:

Is HTML without parse errors but ignoring validity a context-free language?

Is HTML without parse errors and ignoring general validity but with only valid element names allowed a context-free language?

(Complication #2 applies in both cases.)

回复收藏 0 原文

优雅的叶子 2024-10-27 14:06:03

否

请参阅下面的编辑

~~这取决于。~~

~~如果您正在谈论仅由理论上的 HTML 组成的子集，那么是。~~

如果您还包括现实生活中的工作 HTML，每天有数百万人在互联网上的许多顶级网站上成功访问和使用该 HTML，那么否< /强>。

这就是 HTML 的灵活性。解析引擎添加标签、关闭标签，并处理理论上 CFG 无法完成的事情。如果您学过自动机，您可能会记得，形式语法中的产生式规则的 lhs（左侧）不能为空（又名 epsilon/lambda）。由于解析引擎基本上使用形式语法和自动机无法拥有的知识，因此它不受此限制，并且“语法”将具有 epsilon/lambda -> 。结果，其中基于语法中不可用的信息选择特定的 epsilon/lambda 规则。

由于我认为任何形式语法中都不允许空 lh，因此 HTML 不能由形式语法定义，并且根本不是形式语言。

当然，HTML5 可能会尝试转向“更正式”的语言描述，但实际上它成为上下文无关语言（即拒绝与语法不匹配的字符串）的可能性与 XHTML 的可能性有关2.0 席卷了整个世界，并完全取代了 HTML（XHTML 是他们为使 HTML 成为一种正式语言所做的尝试……由于其脆弱性而被集体拒绝）。

值得注意的是，HTML 5 是第一个在实施之前定义的 HTML 标准！没错，HTML 1-4 由某人刚刚在浏览器中实现的随机想法组成，并根据哪些功能被广泛使用和广泛实现而被收集到标准中。然后他们尝试了XHTML，但完全没有被采用。甚至网络上的“xhtml”在几乎所有情况下都会自动解析为 HTML，以防止内容因神秘的语法错误而中断。现在您可以了解我们是如何走到这一步的，以及为什么它不太可能很快正式化。

教训：“理论上，理论和实践没有区别。在实践中，是有区别的。” - Yogi Berra

编辑：

实际上，在通读文档后发现，即使根据 HTML 4.01 规范，HTML 实际上也不符合 SGML。要亲自查看，请查看 HTML 4.01 严格文档类型定义 (doctype)，网址为 http:// www.w3.org/TR/html4/strict.dtd 并注意以下几行：

HTML 4.01 规范包括附加内容
无法表达的句法约束
DTD。

因此，我想说，由于这些功能，它可能不是 CFL（尽管从技术上讲，它并不能反驳存在某些可能接受 HTML 4.01 的 PDA 的假设，但它确实阻止了这样的论点： SGML 是一种 CFL，因此 HTML 是一种 CFL）。

HTML5 触发器，放弃任何隐含的 SGML 一致性，但可能可以通过 CFG 进行描述。然而，它仍然会提供不基于 cfg 的尽力解析，因此 IMO 目前的情况（即语言规范已正式定义，无效字符串仍以尽力而为的方式被接受、解析和呈现）在这方面不太可能在很长、很长、很长一段时间内发生巨大的变化。

NO

See Edit Below

~~It depends.~~

~~If you are talking about the subset consisting of only theoretical HTML, then yes.~~

If you also include real life, working HTML that is accessed and used successfully by millions of people daily on many of the top sites on the internet then NO.

That is what gives HTML flexibility. The parsing engine adds tags, closes tags, and takes care of stuff that a theoretical CFG can't do. If you took automata you might remember that a production rule in a formal grammar cannot be empty (aka epsilon/lambda) on the lhs (left-hand side). Since the parsing engine is basically using knowledge that a formal grammar and automata couldn't have, it isn't restricted by that and the 'grammar' would have epsilon/lambda -> result where the specific epsilon/lambda rule is chosen based on information not available in the grammar.

Since I don't think empty lhs are allowed in any formal grammars, HTML cannot be defined by a formal grammar and is not a formal language at all.

Sure, HTML5 might try to move towards a 'more formal' language description but the likelihood that it becomes a context free language in reality (i.e. strings not matched by the grammar are rejected) is about the likelihood XHTML 2.0 takes the world by storm and replaces HTML altogether (XHTML is the attempt they made to make HTML a formal language...it was rejected en masse due to its fragility).

Noteworthy is the fact that HTML 5 is the FIRST HTML standard to be defined before being implemented! That's right, HTML 1-4 consist of random ideas someone just implemented in a browser, and were collected into standards after the fact based on which features were popularly used and widely implemented. Then they tried XHTML, which totally failed to be adopted. Even 'xhtml' on the web is automatically parsed as HTML under almost every circumstance to prevent stuff from just breaking with a cryptic syntax error. Now you can see how we got here and why it is unlikely to be formalized any time soon.

Lesson: "In theory, there is no difference between theory and practice. In practice, there is." - Yogi Berra

EDIT:

Actually, after reading through the documents it turns out that HTML, even according to the HTML 4.01 specification, doesn't actually conform to SGML. To see for yourself, view the HTML 4.01 Strict document type definition (doctype) at http://www.w3.org/TR/html4/strict.dtd and note the following lines:

The HTML 4.01 specification includes additional
syntactic constraints that cannot be expressed within
the DTDs.

So I would say that it is probably not a CFL due to those features (although it technically it doesn't disprove the hypothesis that there is some possible PDA that accepts HTML 4.01, it does prevent the argument that SGML is a CFL therefore HTML is a CFL).

HTML5 flip-flops, abandoning any implied conformance to SGML, but is presumably describable by a CFG. However it will still provide best-effort parsing not based on a cfg, so IMO the current situation (i.e. language specification is defined formally, with invalid strings still being accepted, parsed and rendered in a best effort fashion) in this regard is unlikely to change drastically for a long, long, long time.

回复收藏 0 原文