有什么工具可以根据语言语法随机生成源代码吗?
C程序源代码可以根据C语法(CFG中描述)进行解析,最终转化为许多AST。我正在考虑是否存在这样的工具:它可以通过首先随机生成许多 AST 来完成相反的操作,其中包括没有具体字符串值的标记,只有标记的类型,根据 CFG,然后生成具体的根据正则表达式中标记的定义进行标记。
我可以想象第一步看起来像迭代的非终结符替换,它是随机的并且可以受到一定数量的迭代次数的限制。第二步是根据正则表达式随机生成字符串。
有什么工具可以做到这一点吗?
A C program source code can be parsed according to the C grammar(described in CFG) and eventually turned into many ASTs. I am considering if such tool exists: it can do the reverse thing by firstly randomly generating many ASTs, which include tokens that don't have the concrete string values, just the types of the tokens, according to the CFG, then generating the concrete tokens according to the tokens' definitions in the regular expression.
I can imagine the first step looks like an iterative non-terminals replacement, which is randomly and can be limited by certain number of iteration times. The second step is just generating randomly strings according to regular expressions.
Is there any tool that can do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
“数据生成语言”DGL 可以做到这一点,并具有加权概率的附加功能正在输出的语法中的产生式。
一般来说,递归下降解析器可以直接重写为一组递归过程来生成而不是解析/识别语言。
The "Data Generation Language" DGL does this, with the added ability to weight the probabilities of productions in the grammar being output.
In general, a recursive descent parser can be quite directly rewritten into a set of recursive procedures to generate, instead of parse / recognise, the language.
给定一种语言的上下文无关语法,可以生成与该语法匹配的随机字符串。
例如,nearley 解析器生成器包括“unparser”的实现。 可以从语法生成字符串。
在 Prolog 中使用定子句语法可以完成相同的任务。 此处给出了使用定语从句语法的句子生成器的示例。
Given a context-free grammar of a language, it is possible to generate a random string that matches the grammar.
For example, the nearley parser generator includes an implementation of an "unparser" that can generate strings from a grammar.
The same task can be accomplished using definite clause grammars in Prolog. An example of a sentence generator using definite clause grammars is given here.
如果您有标准化形式的语法模型(所有规则都像这样):
和语言 Prettyprinter(例如 AST 到文本转换工具),您可以非常轻松地构建其中一个。
只需从目标符号作为单位树开始。
对于携带值(例如变量名、数字、字符串……)的终端,您必须生成随机内容。
上述算法的一个复杂之处在于它没有明确终止。您真正想要做的是对树的大小选择一些限制,然后运行算法,直到所有非终结符消失或超出限制。在后一种情况下,回溯,撤消上次替换,然后尝试其他操作。这可以让您对确定大小的 AST 进行有界深度优先搜索。
然后漂亮地打印结果。它的 prettyprinter 部分 很难做到正确。
[你可以自己构建所有这些东西,包括漂亮的打印机,但这是一个相当大的工作量。我构建的工具直接以语言参数化的方式包含所有这些机制;请参阅我的简历]。
即使对于结构良好的 AST,一个令人讨厌的问题是它们可能是无意义的;对于不允许这样做的语言,您可能会生成整数 X 的声明,并为其分配字符串文字值。您也许可以消除一些简单的问题,但语言语义可能非常复杂,以 C++ 为例。确保你最终得到一个语义上有意义的程序是非常困难的;本质上,您必须解析结果文本,并对其执行名称和类型解析/检查。对于 C++,您需要一个完整的 C++ 前端。
If you have a model of the grammar in a normalized form (all rules like this):
and language prettyprinter (e.g., AST to text conversion tool), you can build one of these pretty easily.
Simply start with the goal symbol as a unit tree.
For terminals that carry values (e.g., variable names, numbers, strings, ...) you'll have to generate random content.
A complication with the above algorithm is that it doesn't clearly terminate. What you actually want to do is pick some limit on the size of your tree, and run the algorithm until the all nonterminals are gone or you exceed the limit. In the latter case, backtrack, undo the last replacement, and try something else. This gets you a bounded depth-first search for an AST of your determined size.
Then prettyprint the result. Its the prettyprinter part that is hard to get right.
[You can build all this stuff yourself including the prettyprinter, but it is a fair amount of work. I build tools that include all this machinery directly in a language-parameterized way; see my bio].
A nasty problem even with well formed ASTs is that they may be nonsensical; you might produce a declaration of an integer X, and assign a string literal value to it, for a language that doesn't allow that. You can probably eliminate some simple problems, but language semantics can be incredibly complex, consider C++ as an example. Ensuring that you end up with a semantically meaningful program is extremely hard; in essence, you have to parse the resulting text, and perform name and type resolution/checking on it. For C++, you need a complete C++ front end.
随机生成的问题是,对于许多 CFG,输出字符串的预期长度是无限的(使用与非终结符相对应的生成函数和与语法规则相对应的方程可以轻松计算预期长度) ;你必须以某种方式控制产生式的相对概率以保证收敛;例如,有时,将非终结符号的每个产生规则加权与其 RHS 长度相反就足够了,
关于这个主题的更多内容参见:
Noam Chomsky,Marcel-Paul Sch\"{u}tzenberger,``上下文无关语言的代数理论'',第 118-161 页,P. Braffort 和 D. Hirschberg(编辑),计算机编程和形式化北荷兰系统公司 (1963)
(参见维基百科有关乔姆斯基-舒岑伯格枚举定理的条目)
the problem with random generation is that for many CFGs, the expected length of the output string is infinite (there is an easy computation of the expected length using generating functions corresponding to the non-terminal symbols and equations corresponding to the rules of the grammar); you have to control the relative probabilities of the productions in certain ways to guarantee convergence; for example, sometimes, weighting each production rule for a non-terminal symbol inversely to the length of its RHS suffices
there is lot more on this subject in:
Noam Chomsky, Marcel-Paul Sch\"{u}tzenberger, ``The Algebraic Theory of Context-Free Languages'', pp.\ 118-161 in P. Braffort and D. Hirschberg (eds.), Computer Programming and Formal Systems, North-Holland (1963)
(see Wikipedia entry on Chomsky–Schützenberger enumeration theorem)