从上下文无关语法生成 n 个语句
因此,不是为了重新发明轮子,我想知道从上下文无关语言生成随机语句(如 yacc 等生成的语句)已经做了哪些工作。这些语法主要用于解析,但也许有人已经做了一些生成来测试解析器? 谢谢
So not to reinvent the wheel, I would like to know what has already been done about generating random statements from a context-free language (like those produced by yacc, etc.). These grammars are primarily for parsing, but maybe someone has done some generation for testing the parsers?
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
查看这篇博文。基本上,它随机化每个规则应用中选择的 RHS。
Check out this blog post. Basically, it randomizes the RHS chosen at each rule application.
这里有一篇古老但仍然有趣的文章,它说明了为什么您还需要更多与解析相比,有效生成随机句子的约束——它还建议了一种提供这些额外约束的简单方法,并给出了一个完整的示例程序(...在 Fortran IV 中...但是,嘿,它是 40 岁以上......!-)。
不幸的是,我不知道最近有关于这个主题的任何工作(或者更现代语言的实现)——但是将 Fortran 尘土飞扬的甲板转码为您最喜欢的任何语言不应该像自己提出这些概念那么困难; -) -- 这正是我上大学时在实际纸质图书馆中仔细阅读的那种已经很古老的东西,我很惊讶 ACM 的在线搜索设施允许我找到我依稀记得的参考文献,这么快(荣誉,ACM!-)。我从未对这个主题做过任何原创研究。
There's an ancient but still interesting article here that shows why you need a few more constraints for effective generation of random sentences than you do for parsing -- it also suggests a simple way of providing those extra constraints and gives a complete example program (...in Fortran IV... but, hey, it is over 40 years old...!-).
Unfortunately I am not aware of any more recent work on this subject (or implementations in more modern languages -- but transcoding the Fortran dusty deck to whatever language you like best shouldn't be as hard as coming up with these concepts on one's own;-) -- this is just the kind of already-ancient stuff I perused in actual paper-based libraries back when I was in college, and I'm amazed that the ACM's online search facilities allowed me to locate the reference I vaguely remembered, so rapidly (kudos, ACM!-). I have never done any original research on this subject.
像这样生成一系列随机标记是很简单的;从目标符号开始,选择任何未填充的非终结符的随机扩展。您可能需要对生成的解析树的任何分支进行某种分支定界搜索,以便可以控制深度/大小。
但我认为以这种方式测试解析器没有太多价值,至少如果您的解析器生成器直接接受上下文无关语言描述的话。当您使用完整的上下文无关解析器生成器/工具(例如 GLR(我们在程序转换系统 DMS 中使用的))或 Earley 解析器时,就会发生这种情况。
你还有另一个问题:如果你想测试一个解析器,你需要给它提供它想要的东西,当然那不是令牌。现在您必须为末端叶子生成有效的词位。这也不是太难,但是如果您想对这种方法保持纯粹,您可以以无扫描器的方式编写语法。
但最后一个问题是,很难找到适用于大多数语言的上下文无关语法。所以现在你也必须调试你的黄金语法;你会怎么做?
一旦到达那个阶段,您可能会放弃,调试解析器,然后继续您的生活。
如果您可以拼接随机树来修复源流,那么生成随机子树对于错误恢复很有用。我们为 DMS 的解析器做了一种简化形式,这意味着我们只有一套错误处理机制。
Generation of a sequence of random tokens like this is sort of strightforward; starting with the goal symbol, pick any random expansion of unfilled nonterminals. You probably want some kind of branch-and-bound search down any branch of generated parse tree so you can control depth/size.
But I don't see a lot of value in testing parsers this way, at least not if your parser generator accepts the context free language description directly. This happens when you use a full context free parser generator/tool such as GLR (which is what we use in our program transformation system, DMS) or an Earley parser.
You have another problem: if you want to test a parser, you need to feed it what it wants, and surely that isn't tokens. Now you have to generate valid lexemes for the terminal leaves. That's not too hard either, but it you wanted to be purist about this approach you'd write your grammar in scannerless style.
But the last problem is that it is hard to find a context free grammar for most languages. So now you have to debug your gold grammar too; how will you do that?
Once you get to that stage, you're likely to just give up, debug the parser, and get on with your life.
Generation of random subtrees is useful in error recovery, if you can splice in the random tree to fix the source stream. We do a simplified form of this for DMS's parser, which means we have only one piece of error handling machinery.