用 Java 构建词法分析器
我目前正在学习编译器设计中的词法分析。为了了解词法分析器的真正工作原理,我正在尝试自己构建一个。我打算用 Java 构建它。
词法分析器的输入是一个 .tex 文件,其格式如下。
\begin{document}
\chapter{Introduction}
\section{Scope}
Arbitrary text.
\section{Relevance}
Arbitrary text.
\subsection{Advantages}
Arbitrary text.
\subsubsection{In Real life}
\subsection{Disadvantages}
\end{document}
词法分析器的输出应该是一个目录,可能包含另一个文件中的页码。
1. Introduction 1
1.1 Scope 1
1.2 Relevance 2
1.2.1 Advantages 2
1.2.1.1 In Real Life 2
1.2.2 Disadvantages 3
我希望这个问题在词法分析的范围内。
我的词法分析器将读取 .tex 文件并检查“\”,并在找到时继续读取以检查它是否确实是分段命令之一。设置标志变量来指示切片的类型。根据类型和深度,读取和写入切片命令后面的花括号中的单词,并带有数字前缀(如 1.2.1)。
我希望上述方法适用于构建词法分析器。如果在词法分析器的范围内可能的话,我该如何将页码添加到目录中?
I am presently learning Lexical Analysis in Compiler Design. In order to learn how really a lexical analyzer works I am trying to build one myself. I am planning to build it in Java.
The input to the lexical analyzer is a .tex file which is of the following format.
\begin{document}
\chapter{Introduction}
\section{Scope}
Arbitrary text.
\section{Relevance}
Arbitrary text.
\subsection{Advantages}
Arbitrary text.
\subsubsection{In Real life}
\subsection{Disadvantages}
\end{document}
The output of the lexer should be a table of contents possibly with page numbers in another file.
1. Introduction 1
1.1 Scope 1
1.2 Relevance 2
1.2.1 Advantages 2
1.2.1.1 In Real Life 2
1.2.2 Disadvantages 3
I hope that this problem is within the scope of the lexical analysis.
My lexer would read the .tex file and check for '\' and on finding continues reading to check whether it is indeed one of the sectioning commands. A flag variable is set to indicate the type of sectioning. The word in curly braces following the sectioning command is read and written along prefixed with a number (like 1.2.1) depending upon the type and depth.
I hope the above approach would work for building the lexer. How do I go about in adding page numbers to the table of contents if that's possible within the scope of the lexer?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您确实可以以任何您想要的方式添加它们。我建议将 .tex 文件的内容存储在您自己的树状或地图状结构中,然后读取页码文件,并适当地应用它们。
一个更古老的选项是编写第二个解析器,它解析第一个解析器的输出和行号文件,并适当地附加它们。
这真的取决于你。由于这是一个学习练习,请尝试像其他人使用它一样进行构建。它的用户友好性如何?制作只有你可以使用的东西仍然有利于概念学习,但如果你在现实世界中使用它,可能会导致混乱的实践!
You really could add them any way you want. I would recommend storing the contents of your .tex file in your own tree-like or map-like structure, then read in your page numbers file, and apply them appropriately.
A more archaic option would be to write a second parser that parses the output of your first parser and the line numbers file and appends them appropriately.
It really is up to you. Since this is a learning exercise, try to build as if someone else were to use it. How user-friendly is it? Making something only you can use is still good for concept learning, but could lead to messy practices if you ever use it in the real world!
你所描述的实际上是一个词法分析器加解析器。词法分析器的工作是返回标记并忽略空格。这里的标记是“\”引入的各种关键字、“{”、“}”内的字符串文字以及其他地方的任意文本。您描述的其他所有内容都是解析和树构建。
What you describe is really a lexer plus parser. The job of the lexical analyser here is to return tokens and ignore whitespace. The tokens here are the various keywords introduced by '\', string literals inside '{', '}' and arbitrary text elsewhere. Everything else you dscribed is parsing and tree-building.