为了论证起见,我们假设一个 HTML 解析器。
我读到它首先对所有内容进行标记,然后对其进行解析。
标记化是什么意思?
解析器是否读取每个字符,构建一个多维数组来存储结构?
例如,它是否读取 <
然后开始捕获该元素,然后一旦遇到结束 >
(在属性之外),它就会被推送到某处的数组堆栈?
我感兴趣是为了了解(我很好奇)。
如果我阅读 HTML Purifier 之类的源代码,这会让我很好地了解 HTML 是如何解析的?
For argument's sake lets assume a HTML parser.
I've read that it tokenizes everything first, and then parses it.
What does tokenize mean?
Does the parser read every character each, building up a multi dimensional array to store the structure?
For example, does it read a <
and then begin to capture the element, and then once it meets a closing >
(outside of an attribute) it is pushed onto a array stack somewhere?
I'm interested for the sake of knowing (I'm curious).
If I were to read through the source of something like HTML Purifier, would that give me a good idea of how HTML is parsed?
发布评论
评论(5)
分词可以由几个步骤组成,例如,如果您有以下 html 代码:
分词器可以将该字符串转换为重要标记的平面列表,
丢弃空格(感谢 SasQ 的更正) :可能有多个标记化过程将标记列表转换为更高级别的标记列表,就像下面假设的 HTML 解析器可能所做的那样(这仍然是一个平面列表):
然后解析器将该标记列表转换为形成一个以更方便程序访问/操作的方式表示源文本的树或图:
至此,解析完成;然后由用户来解释树、修改树等。
Tokenizing can be composed of a few steps, for example, if you have this html code:
the tokenizer may convert that string to a flat list of significant tokens,
discarding whitespaces(thanks, SasQ for the correction):there may be multiple tokenizing passes to convert a list of tokens to a list of even higher-level tokens like the following hypothetical HTML parser might do (which is still a flat list):
then the parser converts that list of tokens to form a tree or graph that represents the source text in a manner that is more convenient to access/manipulate by the program:
at this point, the parsing is complete; and it is then up to the user to interpret the tree, modify it, etc.
首先,您应该意识到解析 HTML 是特别难看的——HTML 在标准化之前就被广泛(且不同)使用。这会导致各种丑陋的情况,例如标准指定不允许某些构造,但随后又指定了这些构造所需的行为。
回答你的直接问题:标记化大致相当于采用英语,并将其分解为单词。在英语中,大多数单词是连续的字母流,可能包括撇号、连字符等。大多数单词被空格包围,但句号、问号、感叹号等也可以表示单词的结束。同样,对于 HTML(或任何其他内容),您可以指定一些有关如何组成该语言中的标记(单词)的规则。将输入分解为标记的代码通常称为词法分析器。
至少在正常情况下,在开始解析之前,您不会将所有输入分解为标记。相反,解析器会在需要时调用词法分析器来获取下一个标记。当它被调用时,词法分析器会查看足够的输入以找到一个标记,将其传递给解析器,并且在解析器下次需要更多输入之前不再对输入进行标记。
一般来说,您对解析器的工作方式是正确的,但是(至少在典型的解析器中)它在解析语句的过程中使用堆栈,但它构建来表示语句的内容通常是一棵树(并且抽象语法树,又名 AST),不是多维数组。
基于解析 HTML 的复杂性,我会保留查看解析器,直到您先阅读其他一些解析器。如果你环顾四周,你应该能够找到相当数量的解析器/词法分析器,例如数学表达式,这些解析器/词法分析器可能更适合作为介绍(更小、更简单、更容易理解等)
First of all, you should be aware that parsing HTML is particularly ugly -- HTML was in wide (and divergent) use before being standardized. This leads to all manner of ugliness, such as the standard specifying that some constructs aren't allowed, but then specifying required behavior for those constructs anyway.
Getting to your direct question: tokenization is roughly equivalent to taking English, and breaking it up into words. In English, most words are consecutive streams of letters, possibly including an apostrophe, hyphen, etc. Mostly words are surrounded by spaces, but a period, question mark, exclamation point, etc., can also signal the end of a word. Likewise for HTML (or whatever) you specify some rules about what can make up a token (word) in this language. The piece of code that breaks the input up into tokens is normally known as the lexer.
At least in a normal case, you do not break all the input up into tokens before you start parsing. Rather, the parser calls the lexer to get the next token when it needs one. When it's called, the lexer looks at enough of the input to find one token, delivers that to the parser, and no more of the input is tokenized until the next time the parser needs more input.
In a general way, you're right about how a parser works, but (at least in a typical parser) it uses a stack during the act of parsing a statement, but what it builds to represent a statement is normally a tree (and Abstract Syntax Tree, aka AST), not a multidimensional array.
Based on the complexity of parsing HTML, I'd reserve looking at a parser for it until you've read through a few others first. If you do some looking around, you should be able to find a fair number of parsers/lexers for things like mathematical expressions that are probably more suitable as an introduction (smaller, simpler, easier to understand, etc.)
不要错过 W3C 关于解析 HTML5 的说明。
有关扫描/词法分析的有趣介绍,请在网络上搜索 Efficient Generation of Table-Driven Scanners。它展示了扫描最终如何由自动机理论驱动。正则表达式的集合被转换为单个 NFA 。然后将 NFA 转换为 DFA 以使状态转换具有确定性。然后本文描述了一种将 DFA 转换为转换表的方法。
关键点:扫描仪使用正则表达式理论,但可能不使用现有的正则表达式库。为了获得更好的性能,状态转换被编码为巨型 case 语句或转换表中。
扫描仪保证使用正确的单词(标记)。解析器保证单词以正确的组合和顺序使用。扫描仪使用正则表达式和自动机理论。解析器使用语法理论,尤其是上下文无关语法。
一些解析资源:
Don't miss the W3C's notes on parsing HTML5.
For an interesting introduction to scanning/lexing, search the web for Efficient Generation of Table-Driven Scanners. It shows how scanning is ultimately driven by automata theory. A collection of regular expressions is transformed into a single NFA . The NFA is then transformed to a DFA to make state transitions deterministic. The paper then describes a method to transform the DFA into a transition table.
A key point: scanners use regular expression theory but likely don't use existing regular expression libraries. For better performance, state transitions are coded as giant case statements or in transition tables.
Scanners guarantee that correct words(tokens) are used. Parsers guarantee the words are used in the correct combination and order. Scanners use regular expression and automata theory. Parsers use grammar theory, especially context-free grammars.
A couple parsing resources:
HTML 和 XML 语法(以及其他基于 SGML 的语法)非常难以解析,并且它们不太适合词法分析场景,因为它们不规则。在解析理论中,正则语法是没有任何递归的语法,即自相似、嵌套模式或必须彼此匹配的类似括号的包装器。但是基于 HTML/XML/SGML 的语言确实具有嵌套模式:标签可以嵌套。具有嵌套模式的语法在乔姆斯基分类中处于较高级别:它是上下文无关的,甚至是上下文相关的。
但回到你关于词法分析器的问题:
每种语法都包含两种符号:非终结符(展开为其他语法规则的符号)和终结符(那些“原子”的符号 - 它们是叶子语法树的一部分,并且不要展开到其他任何内容中)。终端符号通常只是标记。令牌从词法分析器中一一抽取,并与其相应的终结符号相匹配。
这些终端符号(标记)通常具有正则语法,更容易识别(这就是为什么它被分解到词法分析器,词法分析器更专门用于正则语法,并且比使用更通用的非正则语法方法可以更快地完成它)。
因此,要为类似 HTML/XML/SGML 的语言编写词法分析器,您需要找到足够原子且规则的语法部分,以便词法分析器轻松处理。问题就出现了,因为一开始并不清楚这些是哪些部分。我为这个问题苦苦挣扎了很长时间。
但上面的Lie Ryan在识别这些部分方面做得非常好。为此为他喝彩!标记类型如下:
<
lexeme,用于起始标记。>
lexeme,用于结束标签。/
词素。=
lexeme,用于将属性名称与其值分开。'
lexeme,用于封装属性值。"
词素,用于封闭属性值。PlainText<
字符且不被上述类型覆盖的文本。您还可以有一些标记实体引用,如
&
或&
可能:&
后跟一些字母数字字符组成的词素 。并以;
结尾。为什么我对
'
和"
使用单独的标记,而不是对属性值使用一个标记?因为常规语法无法识别这些字符中的哪个应该结束序列 - 它取决于开始它的字符(结束字符必须与起始字符匹配)。这种“括号”被认为是非常规语法。所以我将它提升到一个更高的层次——解析器。他的工作是将这些标记(开始和结束)匹配在一起(或者根本不匹配,对于不包含空格的简单属性值)。事后思考:
不幸的是,其中一些标记可能仅出现在其他标记内。因此需要使用词法上下文,这毕竟是另一个状态机,控制着识别特定标记的状态机。这就是为什么我说类似 SGML 的语言不太适合词法分析的模式。
HTML and XML syntax (and others based on SGML) are quite hard to parse and they don't fit well into the lexing scenario, because they're not regular. In the parsing theory, a regular grammar is the one with doesn't have any recursion, that is, self-similar, nested patterns, or parentheses-like wrappers which have to match each other. But HTML/XML/SGML-based languages does have nested patterns: tags could be nested. Syntax with nesting patterns is higher in level in the Chomsky's classification: it's context-free or even context-dependent.
But back to your question about lexer:
Each syntax consists of two kinds of symbols: non-terminal symbols (those which unwind into other syntax rules) and terminal symbols (those which are "atomic" - they are leafs of the syntax tree and don't unwind into anything else). Terminal symbols are often just the tokens. Tokens are pumped one by one from the lexer and matched to their corresponding terminal symbols.
Those terminal symbols (tokens) have often regular syntax, which is easier to recognize (and that's why it's factored out to the lexer, which is more specialized for regular grammars and could do it quicker than by using more general approach of non-regular grammars).
So, to write a lexer for HTML/XML/SGML-like language, you need to find parts of the syntax which are atomic enough and regular, to be dealt with easily by the lexer. And here the problem arises, because it's not at first obvious which parts are these. I struggled with this problem for a long time.
But Lie Ryan above have done a very good job in recognizing these parts. Bravo for him for that! The token types are following:
<
lexeme, used for starting tags.>
lexeme, used for ending tags./
lexeme used in closing tags.=
lexeme, used for separating attribute names from its values.'
lexeme, used for enclosing attribute values."
lexeme, used for enclosing attribute values.<
character directly and not covered by the above types.You can also have some tokens for entity references, like
or
&
. Probably:&
followed by some alphanumeric characters and ended with;
.Why I used separate tokens for
'
and"
and not one token for attribute value? Because regular syntax couldn't recognize which of these characters should end the sequence - it depends on the character which started it (ending character have to match the starting character). This "parenthesizing" is considered non-regular syntax. So I promote it into a higher level - to the Parser. It'd be his job to match these tokens (starting and ending) together (or none at all, for simple attribute values not containing spaces).Afterthought:
Unfortunately, some of these tokens may occur only inside other markup. So the use of lexical contexts is needed, which after all is another state machine controlling the state machines recognizing particular tokens. And that's why I've said that SGML-like languages don't fit well into the schema of lexical analysis.
这是 HTML 5 解析器的工作原理:
This is how HTML 5 Parser works: