当前位置：文江博客话题详情

c++ prng

偏置随机数生成器的可能方法是什么？

发布于 2024-11-27 02:11:06 字数 192 浏览 2 评论 0 原文

我构建了一个单词生成器，它选择一个长度，然后随机选择字母表中的字母来组成单词。

该程序可以工作，但 99% 的输出都是垃圾，因为它没有遵守英语语言的结构，我得到的 x 和 z 单词与 e 一样多。

我有哪些选项可以让 RNG 产生偏差，以便它更频繁地使用常用字母。

我正在使用带有时间种子的 stl 中的 rand() 。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

寂寞笑我太脆弱 2024-12-04 02:11:06

输出仍然是垃圾，因为偏置随机数生成器不足以构造正确的英语单词。但使 rng 产生偏差的一种方法是：

制作大型英语文本（语料库）中字母出现次数的直方图。您会得到类似 500 个“e”、3 个“x”、1 个“q”、450 个“a”、200 个“b”等。
将间隔划分为多个范围，其中每个字母都有一个切片，切片的长度是间隔中出现的次数。 a 得到 [0-450), b [450,650), ..., q [3500,3501)。
生成一个介于 0 和间隔总长度之间的随机数，并检查它落在哪里。 450-650 之间的任何数字都会给你 ab，但只有 3500 会给你一个“q”。

回复收藏 0 原文

挽袖吟 2024-12-04 02:11:06

一种方法是使用字母频率。为每个字母定义一个范围：a = [0, 2]（如果字母“a”有 2% 的机会被使用），b = [2, 5]（3% 的机会），依此类推..然后生成0 到 100 之间的随机数并选择一个字母。

另一种方法是使用非确定性有限自动机，您可以在其中定义某些转换（您可以解析圣经并构建概率）。所以你有很多像这样的转换：例如从“a”到“b”的转换是5%。然后你遍历自动机并生成一些单词。

我刚刚看到正确的术语是马尔可夫链，它可能比 NFA 更好。

回复收藏 0 原文

满天都是小星星 2024-12-04 02:11:06

您可以对某些文本正文进行 n-gram 分析，并将其用作基础为了偏见。您可以通过字母或音节来完成此操作。按音节进行分析可能更复杂。

用字母来做，很容易。您迭代源文本中的每个字符并跟踪您遇到的最后 n-1 个字符。然后，对于下一个字符，将最后 n-1 个字符和这个新字符（n 元语法）添加到频率表中。

这个频率表是什么样的？您可以使用将 n 元语法映射到其频率的映射。但这种方法对于我下面建议的算法来说并不是很好。为此，最好将每个 (n-1)-gram 映射到 n-gram 的最后一个字母与其频率的映射。类似于：std::map>。

进行分析并收集统计数据后，算法将如下所示：

选择一个随机的起始 n 元语法。您之前的分析可能包含加权数据，这些数据通常以字母开头；
从所有以前 n-1 个字母开头的 n 元语法中，随机选择最后一个字母（考虑分析中的权重）；
重复直到到达单词结尾（使用预定义的长度或来自有关单词结尾频率的数据）；

要从一组具有不同权重的值中选取随机值，您可以首先设置累积频率表。然后，您在小于频率总和的范围内选择一个随机数，并查看它落在哪个区间。

例如：

A出现10次；
B出现7次；
C出现9次；

您构建下表：{ A: 10, B: 17, C: 26 }。你选择一个 1 到 26 之间的数字。如果小于 10，则为 A；如果小于，则为 A。如果大于或等于10，但小于17，则为B；如果大于17，则为C。

回复收藏 0 原文

神爱温柔 2024-12-04 02:11:06

您可能希望使用英语的字母频率来获得更真实的输出：http://en.wikipedia。 org/wiki/Letter_Frequency。

但如果您想要可发音的单词，您可能应该从音节生成它们。您可以在线找到更多信息，例如在这里：http://spell.psychology.wustl.edu /SyllStructDistPhon/CVC.html

回复收藏 0 原文

谁许谁一生繁华 2024-12-04 02:11:06

您可以通过阅读源文本得出马尔可夫模型，然后生成“类似”的单词来源。

这也适用于从单词生成句子。嗯，有点作品。

回复收藏 0 原文

没有你我更好 2024-12-04 02:11:06

如果您只想更改单词中的字母频率，而不进行进一步的词汇分析（如 qu 对），请获取英语字母频率列表。

然后创建一个加权随机生成器，它将有更多机会输出 e（七分之一的机会），而不是 x（大约千分之一）。

生成加权随机生成器（rand 生成整数，IIRC）：
1.标准化字母频率，使它们都是整数（对于维基百科频率基本上乘以100000）
2. 制作某种查找表，为每个字母分配一定的范围，如下表所示

letter  | weight  |  start   |    end
a       |   8.17% |      0   |   8167
b       |   1.49% |   8168   |   9659
c       |   2.78% |   9660   |  12441
d       |   4.25% |  12442   |  16694
e       |  12.70% |  16695   |  29396
f       |   2.23% |  29397   |  31624
g       |   2.02% |  31625   |  33639
.....
z       |   0.07% | 99926    |  99999

3. 生成 0 到 99999 之间的随机数，并用它来查找相应的字母。这样，您将获得正确的字母频率。

If you want to change just the letter frequency in the words, without futher lexical analisys (like the qu pair), get a list of english language letter frequencies.

Then create a weighted random generator, that will have more chance to output an e (1 in 7 chance) that a x (around 1 in a 1000).

To generate a weighted random generator (rand generates integers, IIRC):
1. Normalize the letter frequencies, so that they are all integers (for the Wikipedia frequencies basically multiply by 100000)
2. Make some sort of lookup table, where to each letter you assign a certain range, like the table below

letter  | weight  |  start   |    end
a       |   8.17% |      0   |   8167
b       |   1.49% |   8168   |   9659
c       |   2.78% |   9660   |  12441
d       |   4.25% |  12442   |  16694
e       |  12.70% |  16695   |  29396
f       |   2.23% |  29397   |  31624
g       |   2.02% |  31625   |  33639
.....
z       |   0.07% | 99926    |  99999

3. Generate a random number between 0 and 99999, and use that to find the corresponding letter. This way, you will have the correct letter frequencies.

回复收藏 0 原文

离去的眼神 2024-12-04 02:11:06

首先，你需要一张表格，上面有字母和它们的重量之类的东西
就像：

struct WeightedLetter
{
    char letter;
    int  weight;
};

static WeightedLetter const letters[] =
{
    { 'a', 82 },
    { 'b', 15 },
    { 'c', 28 },
    //  ...
};

char getLetter()
{
    int totalWeight = 0;
    for ( WeightedLetter const* iter = begin( letters );
            iter != end( letters );
            ++ iter ) {
        totalWeight += iter->weight;
    }
    int choice = rand() % totalWeight;
                // but you probably want a better generator
    WeightedLetter const* result = begin( letters );
    while ( choice > result->weight ) {
        choice -= result->weight;
        ++ result;
    }
    return result->letter;
}

这只是我的想法，所以它可能包含错误；
至少，第二个循环需要一些验证。但它
应该给你基本的想法。

当然，这仍然不会产生类似英语的单词。这
序列“uq”与“qu”的可能性一样，并且没有什么可以阻止的
没有元音的单词，或只有元音的十个字母的单词。关于英语音系的维基百科页面有一些关于什么组合可以在哪里出现的很好的信息，但是它没有任何关于它们的统计数据。另一方面，如果您试图组成可能的单词，例如 Jabberwocky，那么这可能不是问题：选择随机数量的音节，从 1 到某个最大值，然后是开头、核心和结尾。（不要忘记开头和尾声可以是空的。）

First, you need a table with the letters and their weights, something
like:

struct WeightedLetter
{
    char letter;
    int  weight;
};

static WeightedLetter const letters[] =
{
    { 'a', 82 },
    { 'b', 15 },
    { 'c', 28 },
    //  ...
};

char getLetter()
{
    int totalWeight = 0;
    for ( WeightedLetter const* iter = begin( letters );
            iter != end( letters );
            ++ iter ) {
        totalWeight += iter->weight;
    }
    int choice = rand() % totalWeight;
                // but you probably want a better generator
    WeightedLetter const* result = begin( letters );
    while ( choice > result->weight ) {
        choice -= result->weight;
        ++ result;
    }
    return result->letter;
}

This is just off the top of my head, so it's likely to contain errors;
at the very least, the second loop requires some verification. But it
should give you the basic idea.

Of course, this still isn't going to result in English-like words. The
sequence "uq" is just as likely as "qu", and there's nothing to prevent
a word without a vowel, or a ten letter word with just vowels. The Wikipedia page on English Phonology has some good information as to what combinations can occur where, but it doesn't have any statistics on them. On the other hand, if you're trying to make up possible words, like Jabberwocky, then that may not be a problem: choose a random number of syllables, from 1 to some maximum, then an onset, a nucleus and a coda. (Don't forget that the onset and the coda can be empty.)