如何从正式语法生成句子？

发布于 2024-07-14 08:18:33 字数 576 浏览 6 评论 0原文

从语法生成句子的常见方法是什么？

我想要一种与解析器相反的算法。也就是说，给定一个正式的上下文无关语法（例如 LL），我想生成一个符合该语法的任意句子。我在这里使用句子来表示任何有效的文本正文，因此它实际上可以是一个完整的程序（即使它没有任何意义 - 只要它的语法正确）。

语法示例：

program   : <imports> NEWLINE? <namespace>
imports   : ("import" <identifier> NEWLINE)* 
namespace : "namespace " <identifier> NEWLINE "{" <classes> "}" 
identifier: (A-Za-z_) (A-Za-z0-9_)*
...

生成的程序示例：

import jkhbhhuob
import aaaaa888_

namespace u8nFGubgykb
{ class ui0op_np { ... }
}

原文

What's a common way of generating sentences from a grammar?

I want an algorithm that's sort of the opposite of a parser. That is, given a formal context-free grammar (say LL), I want to generate an arbitrary sentence that conforms to that grammar. I use sentence here to mean any valid body of text, so it can actually be a whole program (even if it doesn't make any sense—as long as it's syntactially correct).

Example grammar:

program   : <imports> NEWLINE? <namespace>
imports   : ("import" <identifier> NEWLINE)* 
namespace : "namespace " <identifier> NEWLINE "{" <classes> "}" 
identifier: (A-Za-z_) (A-Za-z0-9_)*
...

Example generated program:

import jkhbhhuob
import aaaaa888_

namespace u8nFGubgykb
{ class ui0op_np { ... }
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

楠木可依 2024-07-21 08:18:33

下面是一个使用 NLTK 的 Python 示例：

from nltk import parse_cfg, ChartParser
from random import choice

def produce(grammar, symbol):
    words = []
    productions = grammar.productions(lhs = symbol)
    production = choice(productions)
    for sym in production.rhs():
        if isinstance(sym, str):
            words.append(sym)
        else:
            words.extend(produce(grammar, sym))
    return words

grammar = parse_cfg('''
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
V -> 'shot' | 'killed' | 'wounded'
Det -> 'an' | 'my' 
N -> 'elephant' | 'pajamas' | 'cat' | 'dog'
P -> 'in' | 'outside'
''')

parser = ChartParser(grammar)

gr = parser.grammar()
print ' '.join(produce(gr, gr.start()))

该示例改编自书。生成的句子在语法上是正确的，但仍然是胡言乱语。

Here is a Python example using the NLTK:

from nltk import parse_cfg, ChartParser
from random import choice

def produce(grammar, symbol):
    words = []
    productions = grammar.productions(lhs = symbol)
    production = choice(productions)
    for sym in production.rhs():
        if isinstance(sym, str):
            words.append(sym)
        else:
            words.extend(produce(grammar, sym))
    return words

grammar = parse_cfg('''
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
V -> 'shot' | 'killed' | 'wounded'
Det -> 'an' | 'my' 
N -> 'elephant' | 'pajamas' | 'cat' | 'dog'
P -> 'in' | 'outside'
''')

parser = ChartParser(grammar)

gr = parser.grammar()
print ' '.join(produce(gr, gr.start()))

The example is adapted from the book. The sentences generated are syntactically correct but still total gibberish.

回复收藏 0 原文

若水微香 2024-07-21 08:18:33

我不知道有一个“通用”算法可以做到这一点。随机程序生成用于遗传编程，因此您可以寻找基于语法的 GP 系统并了解它们如何处理程序生成。我会像伪代码一样执行递归规则生成算法：

void GenerateRule(someRule)
{
  foreach (part in someRule.Parts)
  {
    if (part.IsLiteral) OutputLiteral(part);
    if (part.IsIdentifier) Output(GenerateIdentifier(part)));
    if (part.IsRule) GenerateRule(part.Rule);
  }
}

这假设您已将所有部分读入某些数据结构。您还需要处理重复（随机生成它们发生的次数）和可选规则（掷硬币看看它们是否存在）。

编辑：哦，如果规则有多个选项，您只需选择其中一个选项，并以相同的方式处理它。因此，如果某个规则是（文字|变量），您将在两者之间随机选择。

I don't know that there's a "common" algorithm for doing this. Random program generation is used in genetic programming so you could look for a grammar based GP system and see how they handle program generation. I would do a recursive rule generation algorithm like the pseudo-code:

void GenerateRule(someRule)
{
  foreach (part in someRule.Parts)
  {
    if (part.IsLiteral) OutputLiteral(part);
    if (part.IsIdentifier) Output(GenerateIdentifier(part)));
    if (part.IsRule) GenerateRule(part.Rule);
  }
}

This assumes that you've read in all of the parts into some data structure. You'd also need to handle the repetitions(randomly generate the number of times they occur) and optional rules (flip a coin to see if they are there or not).

Edit: Oh, and if the rule has more than one option, you'd just pick one of the options to go with, and process it the same way. So if some rule was (Literal|Variable), you'd randomly pick between the two.

回复收藏 0 原文

ま昔日黯然 2024-07-21 08:18:33

你的解决方案应该遵循语法的归纳结构。如何为以下每个内容生成随机话语？

终结符号
非终结符号
右侧序列右侧
选择
右侧星形闭包

如果您写下用于表示语法的数据结构，这一切都会变得更加清晰。相互递归生成器函数集的结构将非常密切地反映该数据结构。

处理无限递归有点冒险。最简单的方法是生成话语流并保持深度截止。或者，如果您使用像 Haskell 这样的惰性语言，您可以生成所有话语，并根据需要剥离尽可能多的有限话语（这是一个比原始问题更棘手的问题，但非常有趣）。

回复收藏 0 原文

治碍 2024-07-21 08:18:33

我突然想到：

我会递归地工作（基本上与递归体面的解析器相反），使用一些关于如何处理范围的启发式方法（(...)：可能随机选择）可选（？：参见下面的[]），重复（''泊松分布？）。文字 ("...") 可以简单地写入输出，而子标记 (`<...>') 会生成递归。

这应该不会太难，除非您想保证某种完整的覆盖范围。即便如此，仅仅生成一堆数据就会有所帮助...

[*]您需要包含少于 50% 的时间的选项，以防止在处理

 nonterm:  otherstuff <nonterm>?

Good catch by < 等规则时出现无限回归。 a href="https://stackoverflow.com/questions/603687/how-do-i-generate-sentences-from-a-formal-grammar/603768#603768">底座。

同样，对于重复，抛出一个强烈收敛的分布。

如果输入语法以 BNF 形式呈现，则需要首先解析输入语法，如下所示。最简单的事情是使用映射 (name, string)，然后从最高级别的令牌开始（您可能认为这意味着第一个......）。

这给你：

("program", "NEWLINE?")
("导入", ("导入" <标识符> NEWLINE)*)
...

以“program”开头，点击“”进口>” 所以你会重复...回来时，发出“NEWLINE？”，所以扔骰子并写或不写，点击“<命名空间>” 所以重复...返回时你就完成了。

我发现自己怀疑以前曾做过这样的事。如果您只需要输出，我会搜索网络...也许 http:// /portal.acm.org/itation.cfm?doid=966137.966142，尽管大量的解析器生成器使搜索空间变得混乱......尝试这篇论文也是如此。

顺便说一句，您当地的大学可能有这些期刊的在线订阅，因此您可以通过图书馆免费获取它们。

Off the top of my head:

I'd work recursively (basically the opposite of a recursive decent parser) using some heuristics about what to do with ranges ((...): probably pick at random) optionals (?: see [], below), repetitions('' Poisson distribution?). Literals ("...") are simple written to the output, and subtokens (`<...>') generate a recursion.

This shouldn't be too hard unless you want to guarantee some sort of complete coverage. Even then, just generating a bunch of data would be a help...

[*] You need to include optionals less than 50% of the time to prevent infinite regress when processing rules like

 nonterm:  otherstuff <nonterm>?

Good catch by plinth.

Likewise with repetitions, throw a distributions that converges strongly.

You'll need to parse the input grammar first if it is presented in a BNF form as here. Simplest thing to do would be use a mapping (name, string), then start with the highest level token (which you might assume means the first one...).

This gives you:

("program", "<imports> NEWLINE? <namespace>")
("imports", ("import" <identifier> NEWLINE)*)
...

The you start with "program", hit "<imports>" so you recur...on coming back, hist "NEWLINE?", so throw the dice and write or not, hit "<namespace>" so recur...on return you're done.

I find my self suspecting that this has been done before. If you just need the output, I'd search the web... Perhaps http://portal.acm.org/citation.cfm?doid=966137.966142, though the huge number of parser generators out there clutter up the search space... Try this paper, too.

BTW-- You local university probably has online subscriptions to these journals, so you can get them for free by hooking up at the library.

回复收藏 0 原文