如何在 Javascript 中实现词法分析
嘿伙计们,感谢您的阅读,
我目前正在尝试做一个谷歌风格的计算器。你输入一个字符串,它判断是否可以计算并返回结果。
我慢慢地从基础知识开始:+ - / *
和括号处理。
我愿意随着时间的推移改进计算器,并且不久前学习了一些有关词法分析的知识,我构建了一个标记列表和相关的正则表达式模式。
这种工作很容易适用于 Lex 和 Yacc 等语言,但我正在开发一个仅使用 Javascript 的应用程序。
我试图将这个想法转录成 Javascript,但我不知道如何以干净、美观的方式处理所有内容,尤其是嵌套括号。
分析
让我们定义什么是计算器查询:
// NON TERMINAL EXPRESSIONS //
query -> statement
query -> ε // means end of query
statement -> statement operator statement
statement -> ( statement )
statement -> prefix statement
statement -> number
number -> integer
number -> float
// TERMINAL EXPRESSIONS //
operator -> [+*/%^-]
prefix -> -
integer -> [0-9]+
float -> [0-9]+[.,][0-9]+
Javascript
词法分析在于验证没有任何东西看起来不像终端表达式之一:运算符、前缀、整数和浮点数。可以将其缩短为一个正则表达式:(
我添加了空格以使其更具可读性)
var calcPat =
/^ (\s*
( ([+/*%^-]) | ([0-9]+) | ([0-9]+[.,][0-9]+) | (\() | (\)) )
)+ \s* $/;
如果此测试通过,则查询在词法上是正确的,需要进行语法检查以确定是否可以计算它。 这是棘手的部分
我不会粘贴代码,因为它不干净也不容易理解,但我将解释我遵循的过程以及为什么我陷入困境:
我创建了一个名为的方法isStatement(string)
应该递归地调用自身。主要思想是将字符串拆分为“潜在”语句,并检查它们是否确实是语句并形成一个整体。
过程如下:
-如果前两个标记是数字后跟运算符:
-则,
-- 如果剩下的只是一个标记并且是一个数字:
--- 那么这是一个声明。
--- 否则,检查剩余的标记是否形成语句(递归调用)
-否则,如果第一个标记是括号
-然后,找到匹配的右括号并检查里面是否是语句(递归)
-- 还要检查右括号后面是否有内容,以及与括号结构关联时是否形成语句。
有什么问题吗?
我的问题是,当存在嵌套结构时,我找不到匹配的括号。 我怎样才能做到这一点?此外,正如您所看到的,这不是一个特别通用和干净的语法检查算法。您有什么想法来改进这种模式吗?
非常感谢您花时间阅读所有内容。 Gael
(PS:正如您可能注意到的那样,我的母语不是英语!对于错误以及所有错误,我们深表歉意!)
Hey folks, thanks for reading
I am currently attempting to do a Google-style calculator. You input a string, it determines if it can be calculated and returns the result.
I began slowly with the basics : + - / *
and parenthesis handling.
I am willing to improve the calculator over time, and having learned a bit about lexical analysis a while ago, I built a list of tokens and associated regular expression patterns.
This kind of work is easily applicable with languages such as Lex and Yacc, except I am developping a Javascript-only application.
I tried to transcript the idea into Javascript but I can't figure out how to handle everything in a clean and beautiful way, especially nested parenthesis.
Analysis
Let's define what a calculator query is:
// NON TERMINAL EXPRESSIONS //
query -> statement
query -> ε // means end of query
statement -> statement operator statement
statement -> ( statement )
statement -> prefix statement
statement -> number
number -> integer
number -> float
// TERMINAL EXPRESSIONS //
operator -> [+*/%^-]
prefix -> -
integer -> [0-9]+
float -> [0-9]+[.,][0-9]+
Javascript
Lexical Analysis consists in verifying there is nothing that doesn't look like one of the terminal expressions : operator, prefixes, integer and float. Which can be shortened to one regular expression:
(I added spaces to make it more readable)
var calcPat =
/^ (\s*
( ([+/*%^-]) | ([0-9]+) | ([0-9]+[.,][0-9]+) | (\() | (\)) )
)+ \s* $/;
If this test passes, query is lexically correct and needs to be grammar-checked to determine if it can be calculated. This is the tricky part
I am not going to paste code because it is not clean nor easily understandable, but I am going to explain the process I followed and why I'm stuck:
I created a method called isStatement(string)
that's supposed to call itself recursively. The main idea is to split the string into 'potential' statements and check if they really are statements and form one altogether.
Process is the following:
-If the first two tokens are a number followed by an operator:
-Then,
-- If the remaining is just one token and it is a number:
--- Then this is a statement.
--- Else, check if the remaining tokens form a statement (recursive call)
-Else, If the first token is a parenthesis
-Then, Find matching closing parenthesis and check if what's inside is a statement (recursion)
-- Also check if there is something after closing parenthesis and if it forms a statement when associated with the parenthesis structure.
What's the problem ?
My problem is that I cannot find matching parenthesis when there is nested structures. How can I do that ? Also, as you can see, this is not a particurlarly generic and clean grammar-checking algorithm. Do you have any idea to improve this pattern ?
Thank you so much for having taken the time to read everything.
Gael
(PS: As you probably noticed, I am not a native english speaker ! Sorry for mistakes and all !)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您对词法分析是什么有正确的认识,但您似乎对标记语法和语言语法之间的区别感到困惑。这是两件不同的事情。
标记语法是一组模式(通常是正则表达式),用于描述要解析的语言的标记。正则表达式是字符集上的表达式。
语言语法(我想是目标语法)是您要解析的语言的语法。该语法以标记的形式表示。
您无法编写正则表达式来解析代数符号。您就是做不到。您可以为其编写语法,但它不是常规语法。您想要做的是识别单独的标记,在您的情况下,可以使用类似于您所拥有的正则表达式来完成。诀窍在于,您并没有真正将该表达式应用于要解析的整个句子。相反,您想要匹配句子中当前点的标记。
现在,因为您已经有了可以使用的 Javascript 正则表达式,所以您可以想出一个旨在匹配标记字符串的正则表达式。这样做的技巧是想出一种方法来识别哪个标记与可能性列表中的匹配。 Javascript 正则表达式引擎可以给你返回组数组,所以也许你可以在此基础上构建一些东西。
编辑 - 我正在尝试弄清楚如何从一系列单独的正则表达式(每个标记一个)开始构建一个(某种程度上)通用的标记生成器构建器。它可能不是很复杂,而且会很有趣。
You've got the right idea about what lexical analysis is, but you seem to have gotten confused about the distinction between the token grammar and the language grammar. Those are two different things.
The token grammar is the set of patterns (usually regular expressions) that describe the tokens for the language to be parsed. The regular expressions are expressions over a character set.
The language grammar (or target grammar, I suppose) is the grammar for the language you want to parse. This grammar is expressed in terms of tokens.
You cannot write a regular expression to parse algebraic notation. You just can't. You can write a grammar for it, but it's not a regular grammar. What you want to do is recognize separate tokens, which in your case could be done with a regular expression somewhat like what you've got. The trick is that you're not really applying that expression to the overall sentence to be parsed. Instead, you want to match a token at the current point in the sentence.
Now, because you've got Javascript regular expressions to work with, you could come up with a regular expression designed to match a string of tokens. The trick with that will be coming up with a way to identify which token was matched out of the list of possibilities. The Javascript regex engine can give you back arrays of groups, so maybe you could build something on top of that.
edit — I'm trying to work out how you could put together a (somewhat) general-purpose tokenizer builder, starting from a list of separate regular expressions (one for each token). It's possibly not very complicated, and it'd be pretty fun to have around.