我现在正在阅读有关编译器和解析器架构的内容,我想知道一件事......
当您有 XML、XHTML、HTML 或任何基于 SGML 的语言时,词法分析器在这里的作用是什么以及标记是什么?
我读过标记就像准备由词法分析器解析的单词。虽然我在寻找 C、C++、Pascal 等语言的标记时没有问题,其中有关键字、名称、文字和其他由空格分隔的类似单词的字符串,但对于 XML,我有问题,因为没有没有任何言语!它只是与标记(标签)交错的纯文本。
我心想,这些标签和纯文本片段可能是令牌,类似于:[TXT][TAG][TAG][TXT][TAG][TXT][TAG][TAG] [TXT]...
。这是相当合理的,因为 SGML 不关心标记分隔符 <
和 >
内的内容(嗯,当它发现 < code>? 或 !
作为下一个字符;注释也属于该组),并且 SGML 标记器可以作为 XML/HTML/XHTML 解析器的基础。
但后来我意识到,可以在标记内填充 <
字符作为其他语法的一部分:属性值:-/ 即使放置 <
并不是一个好主意code> 字符在属性值中(最好使用 <
),许多浏览器和编辑器都会处理这个问题,并将这些 <
视为属性值的一部分。属性值,而不是标记分隔符。
它使事情变得有点复杂,因为我没有找到一种方法可以通过词法分析器中的简单确定性有限自动机(DFA)来识别这样的标记。当自动机位于标签内部时,它看起来需要一个单独的上下文,而当它遇到属性值时,它需要另一个上下文。我认为这需要一堆状态/上下文,因此 DFA 可能无法处理。我说得对吗?
你的看法是什么?从标签(标记)和纯文本制作令牌好吗?
这里: http://www.antlr.org/wiki/display/ANTLR3/Parsing +XML
使用某种不同的技术:它们处理 <
和 >
(还有 和 />)作为单独的标记,并在标签内使用 GENERIC_ID
作为标记等。他们通常将大部分工作移至解析器。但他们还必须更改分词器的上下文:他们在纯文本中使用不同的上下文,在标记中使用不同的上下文(但我认为他们忘记了属性值上下文,因为 >
的第一次出现将结束他们的词法分析器中的标签)。
那么解析类似 SGML 的语言的最佳方法是什么?词法分析器真的在那里使用吗?如果是,那么什么字符串构成令牌?
I'm reading about compilers and parsers architecture now and I wonder about one thing...
When you have XML, XHTML, HTML or any SGML-based language, what would be the role of a lexer here and what would be the tokens?
I've read that tokens are like words prepared for parsing by the lexer. Although I don't have problem with finding tokens for languages line C, C++, Pascal etc., where there are keywords, names, literals and other word-like strings separated by whitespace, with XML i have a problem, because there aren't any words! It's only plain text interleaved with the markup (tags).
I thought to myself that it could be that these tags and plain text fragments are the tokens, something like that: [TXT][TAG][TAG][TXT][TAG][TXT][TAG][TAG][TXT]...
. It would be quite reasonable, since SGML doesn't care what's inside the markup delimiters <
and >
(well, it recognizes special processing instructions and definitions when it founds ?
or !
as the next character; comments belong to that group too), and the SGML tokenizer could be a base for the XML/HTML/XHTML parser.
But then I realized that there can be <
characters stuffed inside the markup as a part of other syntax: attribute values :-/ Even if it's not quite good idea to put <
characters inside attribute values (it's better to use <
for that), many browsers and editors deal with that and treat these <
as a part of the attribute value, not a tag delimiter.
It complicates things a bit, because I don't see a way to recognize markup like that by a simple Deterministic Finite Automaton (DFA) in the lexer. It looks like it requires a separate context for the automaton when it's inside the tag, and another context when it encounters an attribute value. This would need a stack of states/contexts I think, so DFA might not handle that. Am I right?
What's your view? Is it good to make tokens from tags (markup) and plain text?
Here: http://www.antlr.org/wiki/display/ANTLR3/Parsing+XML
is used some kind different technique: they treat <
and >
(and also </
and />
) as separate tokens, and inside the tags they use GENERIC_ID
as a token etc.They generally move most of the work to the parser. But they also have to change contexts for the tokenizer: they use different context in the plain text, and different in markup (but they forgot about attribute values context I think, because first occurence of >
will end the tag in their lexer).
So what's the best approach for parsing SGML-like languages? Is the lexer really used there? If yes, what strings constitute the tokens?
发布评论
评论(1)
构建了 XML 和 HTML 解析器后,我有自己的看法。
一般来说,词位应该是可识别的语言元素。
对于 XML 和 HTML,这些基本上对应于
如果您想将实体调用保留为单独的令牌,您可以这样做,在 TAGEND 和 TAGSTART 之间生成 CONTENT 和 ENTITYINVOCATION 令牌流;取决于你的目标是什么。
我们可以争论是否要为 HTML/XML 注释生成标记。如果你这样做,你就会这样做。
如果我们忽略 XML 的 DTD 和模式的复杂性,那么这就是您真正需要的。
词法分析器如何生成这些内容更加复杂;对于 XML 和 HTML,输入流中的转义会造成很多混乱,<[CDATA[ ... ]]> (如果我有这个权利的话)这只是一种有趣的引用,当内容词素产生时就会消失。为了处理所有这些,您需要一个非常复杂的词法分析器引擎。
是的,实际上,您需要不同的词汇状态(“模式”)来处理文本的不同部分。我几乎有一种主要模式来处理内部的事物<...>,以及一种主要模式来处理内容。
Having built XML and HTML parsers, I have opinions.
Lexemes in general should be recognizable language elements.
For XML and HTML, these correspond basically to
If you want to keep the entity invocations as seperate tokens, you could do that, producing streams of CONTENT and ENTITYINVOCATION tokens between TAGENDs and TAGSTARTs; depends on what your goal is.
We can argue about whether you want to produce a token for HTML/XML comments or not. If you do, you do.
If we ignore the complications of DTDs and Schemas for XML, that's all you really need.
How the lexer produces these is more complicated; with XML and HTML, there's a lot of messiness having to do with escapes in the input stream, <[CDATA[ ... ]]> (if I have that right) which is just a funny kind of quote and vanishes when the CONTENT lexeme is produced. To handle all this, you need a pretty sophisticated lexer engine.
And yes, as practical matter, you need different lexical states ("modes") to process different parts of the text. I pretty much have one major mode to process things inside <...>, and one major mode to process CONTENT.