编写迷你语言
我有一个应用程序需要允许用户编写类似于 excel 的表达式:
(H1 + (D1 / C3)) * I8
以及更复杂的东西,例如
If(H1 = 'True', D3 * .2, D3 * .5)
我只能用正则表达式做这么多。 任何有关正确方法的建议以及我可以学习的任何资源将不胜感激。
谢谢!
I have an application that needs to allow users to write expressions similar to excel:
(H1 + (D1 / C3)) * I8
and more complex things like
If(H1 = 'True', D3 * .2, D3 * .5)
I can only do so much with regular expressions. Any suggestions as to the right approach to doing this as well as any resources I can learn from would be much appreciated.
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
其他一些问题,您可以在以下位置找到提示:
祝你好运!
Some other question, you will find hints in:
Good luck!
当面临类似的情况时——需要处理短的单行表达式——我编写了一个解析器。 这些表达式是布尔逻辑、形式
等。 在英语中,您可以说存在由 AND 和 OR 连接的原子,每个原子具有三个元素 - 左侧属性、运算符和值。 因为它是如此简洁,我认为解析更容易。 可能的属性集是已知的且有限的(例如:名称、大小、时间)。 运算符因属性而异:不同的属性采用不同的运算符集。 可能值的范围和格式也会根据属性的不同而有所不同。
为了进行解析,我使用 String.Split() 在空格上分割字符串。
后来我意识到,在 Split() 之前,我需要规范化输入字符串 - 在括号前后插入空格。 我用 regex.Replace() 做到了这一点。
分割的输出是一个标记数组。 然后,解析发生在一个大的 for 循环中,并在左侧属性值上切换。 每次循环时,我都会吃掉一组代币。 如果第一个标记是开放括号,则该组的长度只有一个标记:括号本身。 对于众所周知的名称(我的属性值)的标记,解析器必须吸收一组 3 个标记,每个标记代表名称、运算符和值。 如果在任何时候没有足够的标记,解析器就会抛出异常。 根据令牌流,解析器状态将会改变。 连词(AND、OR、XOR)意味着将前一个原子推入堆栈,当下一个原子完成时,我会弹出前一个原子并将这两个原子连接成一个复合原子。 等等。 状态管理发生在解析器每个循环的末尾。
这很简单,只是一点点。 但我们的想法是每个案例陈述都相当简单。 在表达式的原子单元中很容易解析。 棘手的部分是将它们适当地连接在一起。
这个技巧是在每个 slurp 循环结束时的内务管理部分中使用状态堆栈和原子堆栈完成的。 根据解析器状态可能会发生不同的事情。 正如我所说,在每个 case 语句中,解析器状态可能会发生变化,之前的状态会被推入堆栈。 然后,在 switch 语句的末尾,如果状态表明我刚刚完成了对原子的解析,并且有一个未决的连词,我会将刚刚解析的原子移到CompoundAtom 中。 代码如下所示:
另一个神奇之处是 EnumUtil.Parse。 这使我能够解析诸如“<”之类的东西 转换为枚举值。 假设您像这样定义枚举:
通常 Enum.Parse 查找枚举值的符号名称,并且 < 不是有效的符号名称。 EnumUtil.Parse() 查找描述中的内容。 代码如下所示:
我从其他地方得到了 EnumUtil.Parse() 东西。 也许在这里?
When faced with a similar situation - the need to handle short one-line expressions - I wrote a parser. The expressions were boolean logic, of the form
and so on. In english you could say that there are atoms joined by AND and OR, and each atom has three elements - a left-hand-side attribute, an operator, and a value. Because it was so succint I think the parsing was easier. The set of possible attributes is known and limited (eg: name, size, time). The operators vary by attribute: different attributes take different sets of operators. And the range and format of possible values vary according to attribute as well.
To parse, I split the string on whitespace using String.Split().
I later realized that prior to Split(), I needed to normalize the input string - inserting whitespace before and after parens. I did that with a regex.Replace().
The output of the split is an array of tokens. Then parsing occurs in a big for loop with a switch on the left-hand-side attribute value. With each go-round of the loop, I was set to slurp in a group of tokens. If the first token was an open-paren, then the group was just one token in length: the paren itself. For tokens that were well-known names - my attribute values - the parser had to slurp in a group of 3 tokens, one each for the name, the operator, and the value. If at any point there are not enough tokens, the parser throws an exception. Based on the stream of tokens, the parser state would change. A conjunction (AND,OR,XOR) meant to push the prior atom onto a stack, and when the next atom was finished, I'd pop the prior atom and join those two atoms into a compound atom. And so on. The state management happened at the end of each loop of the parser.
That's simplified, just a little. But the idea is that each case statement is fairly simple. It's easy to parse in an atomic unit of the expression. The tricky part was joining them all together appropriately.
That trick was accomplished in the housekeeping section, at the end of each slurp-loop, using the state stack and the atom stack. Different stuff can happen according to the parser state. As I said, in each case statement, the parser state might change, with the prior state getting pushed onto a stack. Then at the end of the switch statement, if the state said I had just finished parsing an atom, and there was a pending conjunction, I'd move the just-parsed atom into the CompoundAtom. The code looks like this:
The one other bit of magic was the EnumUtil.Parse. That allows me to parse things like "<" into an enum value. Suppose you define your enums like this:
Normally Enum.Parse looks for the symbolic name of the enum value, and < is not a valid symbolic name. EnumUtil.Parse() looks for the thing in the description. The code looks like this:
I got that EnumUtil.Parse() thing from somewhere else. Maybe here?
一个小的递归下降解析器非常适合这个。 您可能甚至不必构建解析树 - 您可以在解析时进行评估。
现在,如果您只需调用 ParseTop,它就会为您计算表达式的值。
使用 PARSE_HIGHER 宏的原因是为了更容易在中间优先级添加运算符。
执行“if”语句有点复杂。 每个解析例程都需要一个额外的“启用”参数,因此除非启用,否则它不会进行计算。 然后解析“if”一词,解析测试表达式,然后解析两个结果表达式,禁用非活动表达式。
A little recursive-descent parser is perfect for this. You probably don't even have to build a parse tree - you can do the evaluation as you parse.
Now if you just call ParseTop, it will calculate the value of an expression for you.
The reason for the PARSE_HIGHER macro is to make it easier to add operators at intermediate levels of precedence.
To do the "if" statement is a little more involved. Each parse routine needs an additional "enable" argument, so it does no calculation unless it's enabled. Then you parse the word "if", parse the test expression, and then parse the two result expressions, with the inactive one disabled.
您可以使用 .NET JScript 编译器,或与 IronPython、IronRuby 或 IronScheme 的接口(按字母顺序命名,而不是首选项;p)。
You could use the .NET JScript compiler, or interface with IronPython, IronRuby or IronScheme (named alphabetically, not preference ;p ).
我有一个反例来说明如何不这样做:Will o' Wisp(因为这是我自己的代码,所以我有信心批评它)。
该准则有什么好处?
海龟图形 http://i3.codeplex .com/Project/Download/FileDownload.aspx?ProjectName=wisp&DownloadId=34823
代码有什么不好?
I've got a counter-example of how not to do it: Will o’ the Wisp (since this is my own code I feel confident criticizing it).
What's good about the Code?
Turtle graphics http://i3.codeplex.com/Project/Download/FileDownload.aspx?ProjectName=wisp&DownloadId=34823
What's bad about the code?
查看 ANTLR。 您定义语言语法,使用 GUI 工具对其进行测试并生成各种语言的源代码。 开源。
Check out ANTLR. You define a language syntax, test it using a GUI tool and generate source code in a variety of languages. Open Source.
我会推荐这本书Constructing Little Languages。 它带您了解正确完成此任务所需的许多编译器基础知识。
您提出了这样一个事实:除非您对您的语言有一些严格的限制,否则正则表达式将不起作用。 就像其他人所说的那样, 递归下降解析器 就可以解决问题。
接下来的选择是是否使用 解析器生成器,例如 ANTLR,或者从头开始编写一个。
I would recommend the book Constructing Little Languages. It takes you through many compiler basics needed for completing this task properly.
You've brought up the fact that regular expressions will not work unless you have some stringent limits on your language. Like others have said, a Recursive Descent Parser will do the trick.
The choice next would be whether to use a Parser Generator like ANTLR, or to write one from scratch.
看看这个开源项目:
Excel Financial Functions
Have a look at this open source project:
Excel Financial Functions
我建议查看 CoreCalc/FunCalc 的工作:
http://www.itu.dk/people/sestoft/funcalc/
我'我们在生产中使用了 COCO\R 解析器生成器的语法,并且运行速度非常快。
您需要做的就是:
1.从corecalc获取excel语法
2.在其上运行coco.exe(为类似excel的表达式生成解析器)
3. 将表达式树翻译为逆波兰表示法
4.简单计算
I recommend to look at CoreCalc/FunCalc work:
http://www.itu.dk/people/sestoft/funcalc/
I've used their grammar for COCO\R parser generator in production and it works really fast.
All you need to do is:
1. get excel grammar from corecalc
2. run coco.exe on it (generates parser for excel-like expressions)
3. translate expression tree to reverse polish notation
4. simple calc