如何从正式语法生成句子?

发布于 2024-07-14 08:18:33 字数 576 浏览 6 评论 0原文

从语法生成句子的常见方法是什么?

我想要一种与解析器相反的算法。 也就是说,给定一个正式的上下文无关语法(例如 LL),我想生成一个符合该语法的任意句子。 我在这里使用句子来表示任何有效的文本正文,因此它实际上可以是一个完整的程序(即使它没有任何意义 - 只要它的语法正确)。

语法示例:

program   : <imports> NEWLINE? <namespace>
imports   : ("import" <identifier> NEWLINE)* 
namespace : "namespace " <identifier> NEWLINE "{" <classes> "}" 
identifier: (A-Za-z_) (A-Za-z0-9_)*
...

生成的程序示例:

import jkhbhhuob
import aaaaa888_

namespace u8nFGubgykb
{ class ui0op_np { ... }
}

What's a common way of generating sentences from a grammar?

I want an algorithm that's sort of the opposite of a parser. That is, given a formal context-free grammar (say LL), I want to generate an arbitrary sentence that conforms to that grammar. I use sentence here to mean any valid body of text, so it can actually be a whole program (even if it doesn't make any sense—as long as it's syntactially correct).

Example grammar:

program   : <imports> NEWLINE? <namespace>
imports   : ("import" <identifier> NEWLINE)* 
namespace : "namespace " <identifier> NEWLINE "{" <classes> "}" 
identifier: (A-Za-z_) (A-Za-z0-9_)*
...

Example generated program:

import jkhbhhuob
import aaaaa888_

namespace u8nFGubgykb
{ class ui0op_np { ... }
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

楠木可依 2024-07-21 08:18:33

下面是一个使用 NLTK 的 Python 示例:

from nltk import parse_cfg, ChartParser
from random import choice

def produce(grammar, symbol):
    words = []
    productions = grammar.productions(lhs = symbol)
    production = choice(productions)
    for sym in production.rhs():
        if isinstance(sym, str):
            words.append(sym)
        else:
            words.extend(produce(grammar, sym))
    return words

grammar = parse_cfg('''
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
V -> 'shot' | 'killed' | 'wounded'
Det -> 'an' | 'my' 
N -> 'elephant' | 'pajamas' | 'cat' | 'dog'
P -> 'in' | 'outside'
''')

parser = ChartParser(grammar)

gr = parser.grammar()
print ' '.join(produce(gr, gr.start()))

该示例改编自 。 生成的句子在语法上是正确的,但仍然是胡言乱语。

Here is a Python example using the NLTK:

from nltk import parse_cfg, ChartParser
from random import choice

def produce(grammar, symbol):
    words = []
    productions = grammar.productions(lhs = symbol)
    production = choice(productions)
    for sym in production.rhs():
        if isinstance(sym, str):
            words.append(sym)
        else:
            words.extend(produce(grammar, sym))
    return words

grammar = parse_cfg('''
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
V -> 'shot' | 'killed' | 'wounded'
Det -> 'an' | 'my' 
N -> 'elephant' | 'pajamas' | 'cat' | 'dog'
P -> 'in' | 'outside'
''')

parser = ChartParser(grammar)

gr = parser.grammar()
print ' '.join(produce(gr, gr.start()))

The example is adapted from the book. The sentences generated are syntactically correct but still total gibberish.

若水微香 2024-07-21 08:18:33

我不知道有一个“通用”算法可以做到这一点。 随机程序生成用于遗传编程,因此您可以寻找基于语法的 GP 系统并了解它们如何处理程序生成。 我会像伪代码一样执行递归规则生成算法:

void GenerateRule(someRule)
{
  foreach (part in someRule.Parts)
  {
    if (part.IsLiteral) OutputLiteral(part);
    if (part.IsIdentifier) Output(GenerateIdentifier(part)));
    if (part.IsRule) GenerateRule(part.Rule);
  }
}

这假设您已将所有部分读入某些数据结构。 您还需要处理重复(随机生成它们发生的次数)和可选规则(掷硬币看看它们是否存在)。


编辑:哦,如果规则有多个选项,您只需选择其中一个选项,并以相同的方式处理它。 因此,如果某个规则是(文字|变量),您将在两者之间随机选择。

I don't know that there's a "common" algorithm for doing this. Random program generation is used in genetic programming so you could look for a grammar based GP system and see how they handle program generation. I would do a recursive rule generation algorithm like the pseudo-code:

void GenerateRule(someRule)
{
  foreach (part in someRule.Parts)
  {
    if (part.IsLiteral) OutputLiteral(part);
    if (part.IsIdentifier) Output(GenerateIdentifier(part)));
    if (part.IsRule) GenerateRule(part.Rule);
  }
}

This assumes that you've read in all of the parts into some data structure. You'd also need to handle the repetitions(randomly generate the number of times they occur) and optional rules (flip a coin to see if they are there or not).


Edit: Oh, and if the rule has more than one option, you'd just pick one of the options to go with, and process it the same way. So if some rule was (Literal|Variable), you'd randomly pick between the two.

ま昔日黯然 2024-07-21 08:18:33

你的解决方案应该遵循语法的归纳结构。 如何为以下每个内容生成随机话语?

  • 终结符号
  • 非终结符号
  • 右侧序列 右侧
  • 选择
  • 右侧星形闭包

如果您写下用于表示语法的数据结构,这一切都会变得更加清晰。 相互递归生成器函数集的结构将非常密切地反映该数据结构。

处理无限递归有点冒险。 最简单的方法是生成话语流并保持深度截止。 或者,如果您使用像 Haskell 这样的惰性语言,您可以生成所有话语,并根据需要剥离尽可能多的有限话语(这是一个比原始问题更棘手的问题,但非常有趣)。

Your solution should follow the inductive structure of the grammar. How do you generate a random utterance for each of the following?

  • Terminal symbol
  • Nonterminal symbol
  • Sequence of right-hand sides
  • Choice of right-hand sides
  • Star closure of right-hand sides

This will all be much clearer if you write down the data structure you use to represent a grammar. The structure of your set of mutually recursive generator functions will mirror that data structure very closely.

Dealing with infinite recursion is a bit dicey. The easiest way is to generate a stream of utterances and keep a depth cutoff. Or if you're using a lazy language like Haskell you can generate all utterances and peel off as many finite ones as you like (a trickier problem than the original question, but very entertaining).

治碍 2024-07-21 08:18:33

我突然想到:

我会递归地工作(基本上与递归体面的解析器相反),使用一些关于如何处理范围的启发式方法((...):可能随机选择)可选(:参见下面的[]),重复(''泊松分布?)。 文字 ("...") 可以简单地写入输出,而子标记 (`<...>') 会生成递归。

这应该不会太难,除非您想保证某种完整的覆盖范围。 即便如此,仅仅生成一堆数据就会有所帮助...


[*]您需要包含少于 50% 的时间的选项,以防止在处理

 nonterm:  otherstuff <nonterm>?

Good catch by < 等规则时出现无限回归。 a href="https://stackoverflow.com/questions/603687/how-do-i-generate-sentences-from-a-formal-grammar/603768#603768">底座。

同样,对于重复,抛出一个强烈收敛的分布。


如果输入语法以 BNF 形式呈现,则需要首先解析输入语法,如下所示。 最简单的事情是使用映射 (name, string),然后从最高级别的令牌开始(您可能认为这意味着第一个......)。

这给你:

("program", "NEWLINE?")

("导入", ("导入" <标识符> NEWLINE)*)

...

以“program”开头,点击“”进口>” 所以你会重复...回来时,发出“NEWLINE?”,所以扔骰子并写或不写,点击“<命名空间>” 所以重复...返回时你就完成了。


我发现自己怀疑以前曾做过这样的事。 如果您只需要输出,我会搜索网络...也许 http:// /portal.acm.org/itation.cfm?doid=966137.966142,尽管大量的解析器生成器使搜索空间变得混乱......尝试 这篇论文也是如此。

顺便说一句,您当地的大学可能有这些期刊的在线订阅,因此您可以通过图书馆免费获取它们。

Off the top of my head:

I'd work recursively (basically the opposite of a recursive decent parser) using some heuristics about what to do with ranges ((...): probably pick at random) optionals (?: see [], below), repetitions('' Poisson distribution?). Literals ("...") are simple written to the output, and subtokens (`<...>') generate a recursion.

This shouldn't be too hard unless you want to guarantee some sort of complete coverage. Even then, just generating a bunch of data would be a help...


[*] You need to include optionals less than 50% of the time to prevent infinite regress when processing rules like

 nonterm:  otherstuff <nonterm>?

Good catch by plinth.

Likewise with repetitions, throw a distributions that converges strongly.


You'll need to parse the input grammar first if it is presented in a BNF form as here. Simplest thing to do would be use a mapping (name, string), then start with the highest level token (which you might assume means the first one...).

This gives you:

("program", "<imports> NEWLINE? <namespace>")

("imports", ("import" <identifier> NEWLINE)*)

...

The you start with "program", hit "<imports>" so you recur...on coming back, hist "NEWLINE?", so throw the dice and write or not, hit "<namespace>" so recur...on return you're done.


I find my self suspecting that this has been done before. If you just need the output, I'd search the web... Perhaps http://portal.acm.org/citation.cfm?doid=966137.966142, though the huge number of parser generators out there clutter up the search space... Try this paper, too.

BTW-- You local university probably has online subscriptions to these journals, so you can get them for free by hooking up at the library.

清晰传感 2024-07-21 08:18:33

您将遇到的问题是,图的递归性质使得您可以生成无限大小的正确语法。 您可能想要做一些事情,例如在语法中设置节点类型的哈希值,其中包含允许自己访问该节点的次数和限制。 然后深度优先搜索您想要的内容。

The problem you will have is that the recursive nature of the graph is such that you can generate correct grammars of infinite size. You will probably want to do something like set up a hash of node types in your grammar with counts and limits of how many times you're allowing yourself to hit that node. Then depth first search to your heart's content.

墨小墨 2024-07-21 08:18:33

我的第一个建议是广度优先搜索。 只需设置规则图并搜索它们即可。 您将开始从尽可能小的程序开始编写程序,然后慢慢变大。 不过,您可能会发现,对于给定数量的规则,您的语法会生成指数级更多的程序,并且使用 DFS 在程序中可能无法获得超过 30 个左右的标记。

深度优先搜索的问题在于,一旦你有了左递归规则,你的搜索就会陷入无限循环。

另一个大问题是语法正确的程序与语义正确的程序相距甚远。 除了最基本的情况之外,在所有情况下生成后一种类型可能是完全不可行的。

My first suggestion would be a breadth first search. Just set up a graph of rules and search through them. You'll start spitting out programs starting from the smallest possible ones, and slowly getting larger. You'll likely find, though, that your grammar will spit out exponentially more programs for a given number of rules and you'll likely not get past 30 or so tokens in a program using a DFS.

The problem with a depth first search is that the second you have a left-recursive rule, your search will get stuck in an infinite loop.

Another big problem is that syntactically correct programs are a long way from semantically correct programs. Generating the latter type is likely completely unfeasable in all but the most basic cases.

感性不性感 2024-07-21 08:18:33

不是答案,但请检查有关语法生成的维基百科条目:
一些

它描述了 使用的常用算法。

not an answer, but check the wikipedia entry on grammar generation:
http://en.wikipedia.org/wiki/Context-free_grammar_generation_algorithms

it described some common algorithms used.

ζ澈沫 2024-07-21 08:18:33

像往常一样,我建议不要重新发明轮子。 我已经为 ARM 汇编器编写了其中一个,但我对此表示遗憾(软件:实践和经验,2007 年 4 月):

“回想起来,现成的表达式生成器应该已用于生成随机 ARM 汇编指令以进行比较。 相反,Perl 脚本是增量构建的,采用每个 ARM 指令定义并生成实例。 然而,增量内部方法的一个优点是,简单的替换可以检测到简单的错误,并且错误搜寻可以逐步进行。”

恐怕我不记得是什么让我改变了主意,而且我怀疑这是否与您的特定需求相关,但我确实建议更加努力地寻找现有的解决方案。 自己写这样的东西不需要太多的纪律,但它总是比你预期的要花更长的时间。

As usual, I’m going to advise against reinventing the wheel. I’ve written one of these for ARM assembler, but I’m on record as regretting it (Software: Practice and Experience April 2007):

“In retrospect, an off-the-shelf expression generator should have been used to generate random ARM assembly instructions for comparison. Instead a Perl script was built incrementally, taking each ARM instruction definition and generating instances. An advantage, however, of the incremental in-house approach was that simple substitutions detected simple bugs, and bug hunting could proceed incrementally.”

I’m afraid I don’t recall what made me change my mind, and I doubt it would be relevant to your specific needs, but I do suggest looking harder for a preexisting solution. It requires less discipline to write stuff like this yourself, but it always takes longer than you expect.

执手闯天涯 2024-07-21 08:18:33

虽然这个想法很好(我之前已经想过很多次),但现实是,如果没有一些样本数据和/或大量的生成器约束/工作量限制,这是一项相当大的工作。

人们可能会发现手工编写示例更容易。 :)

While the idea is nice (I have thought of it many times before), the reality is that without some sample data and/or tons of generator constraints/effort limits, it is quite a big job.

One might just find writing samples by hand easier. :)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文