有什么工具可以根据语言语法随机生成源代码吗？

发布于 2024-10-07 17:25:03 字数 239 浏览 3 评论 0原文

C程序源代码可以根据C语法（CFG中描述）进行解析，最终转化为许多AST。我正在考虑是否存在这样的工具：它可以通过首先随机生成许多 AST 来完成相反的操作，其中包括没有具体字符串值的标记，只有标记的类型，根据 CFG，然后生成具体的根据正则表达式中标记的定义进行标记。

我可以想象第一步看起来像迭代的非终结符替换，它是随机的并且可以受到一定数量的迭代次数的限制。第二步是根据正则表达式随机生成字符串。

有什么工具可以做到这一点吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

四叶草在未来唯美盛开 2024-10-14 17:25:03

“数据生成语言”DGL 可以做到这一点，并具有加权概率的附加功能正在输出的语法中的产生式。

一般来说，递归下降解析器可以直接重写为一组递归过程来生成而不是解析/识别语言。

回复收藏 0 原文

离笑几人歌 2024-10-14 17:25:03

给定一种语言的上下文无关语法，可以生成与该语法匹配的随机字符串。

例如，nearley 解析器生成器包括“unparser”的实现。可以从语法生成字符串。

在 Prolog 中使用定子句语法可以完成相同的任务。此处给出了使用定语从句语法的句子生成器的示例。

回复收藏 0 原文

野心澎湃 2024-10-14 17:25:03

如果您有标准化形式的语法模型（所有规则都像这样）：

 LHS = RHS1 RHS2 ...  RHSn ;

和语言 Prettyprinter（例如 AST 到文本转换工具），您可以非常轻松地构建其中一个。

只需从目标符号作为单位树开始。

  Repeat until no nonterminals are left:
    Pick a nonterminal N in the tree;
       Expand by adding children for the right hand side of any rule
       whose left-hand side matches the nonterminal N

对于携带值（例如变量名、数字、字符串……）的终端，您必须生成随机内容。

上述算法的一个复杂之处在于它没有明确终止。您真正想要做的是对树的大小选择一些限制，然后运行算法，直到所有非终结符消失或超出限制。在后一种情况下，回溯，撤消上次替换，然后尝试其他操作。这可以让您对确定大小的 AST 进行有界深度优先搜索。

然后漂亮地打印结果。它的 prettyprinter 部分很难做到正确。

[你可以自己构建所有这些东西，包括漂亮的打印机，但这是一个相当大的工作量。我构建的工具直接以语言参数化的方式包含所有这些机制；请参阅我的简历]。

即使对于结构良好的 AST，一个令人讨厌的问题是它们可能是无意义的；对于不允许这样做的语言，您可能会生成整数 X 的声明，并为其分配字符串文字值。您也许可以消除一些简单的问题，但语言语义可能非常复杂，以 C++ 为例。确保你最终得到一个语义上有意义的程序是非常困难的；本质上，您必须解析结果文本，并对其执行名称和类型解析/检查。对于 C++，您需要一个完整的 C++ 前端。

If you have a model of the grammar in a normalized form (all rules like this):

 LHS = RHS1 RHS2 ...  RHSn ;

and language prettyprinter (e.g., AST to text conversion tool), you can build one of these pretty easily.

Simply start with the goal symbol as a unit tree.

  Repeat until no nonterminals are left:
    Pick a nonterminal N in the tree;
       Expand by adding children for the right hand side of any rule
       whose left-hand side matches the nonterminal N

For terminals that carry values (e.g., variable names, numbers, strings, ...) you'll have to generate random content.

A complication with the above algorithm is that it doesn't clearly terminate. What you actually want to do is pick some limit on the size of your tree, and run the algorithm until the all nonterminals are gone or you exceed the limit. In the latter case, backtrack, undo the last replacement, and try something else. This gets you a bounded depth-first search for an AST of your determined size.

Then prettyprint the result. Its the prettyprinter part that is hard to get right.

[You can build all this stuff yourself including the prettyprinter, but it is a fair amount of work. I build tools that include all this machinery directly in a language-parameterized way; see my bio].

A nasty problem even with well formed ASTs is that they may be nonsensical; you might produce a declaration of an integer X, and assign a string literal value to it, for a language that doesn't allow that. You can probably eliminate some simple problems, but language semantics can be incredibly complex, consider C++ as an example. Ensuring that you end up with a semantically meaningful program is extremely hard; in essence, you have to parse the resulting text, and perform name and type resolution/checking on it. For C++, you need a complete C++ front end.

回复收藏 0 原文

蘑菇王子 2024-10-14 17:25:03

随机生成的问题是，对于许多 CFG，输出字符串的预期长度是无限的（使用与非终结符相对应的生成函数和与语法规则相对应的方程可以轻松计算预期长度） ;你必须以某种方式控制产生式的相对概率以保证收敛；例如，有时，将非终结符号的每个产生规则加权与其 RHS 长度相反就足够了，

关于这个主题的更多内容参见：
Noam Chomsky，Marcel-Paul Sch\"{u}tzenberger，``上下文无关语言的代数理论''，第 118-161 页，P. Braffort 和 D. Hirschberg（编辑），计算机编程和形式化北荷兰系统公司 (1963)
（参见维基百科有关乔姆斯基-舒岑伯格枚举定理的条目）

回复收藏 0 原文

~没有更多了~