偏置随机数生成器的可能方法是什么?

发布于 2024-11-27 02:11:06 字数 192 浏览 2 评论 0 原文

我构建了一个单词生成器,它选择一个长度,然后随机选择字母表中的字母来组成单词。

该程序可以工作,但 99% 的输出都是垃圾,因为它没有遵守英语语言的结构,我得到的 x 和 z 单词与 e 一样多。

我有哪些选项可以让 RNG 产生偏差,以便它更频繁地使用常用字母。

我正在使用带有时间种子的 stl 中的 rand() 。

I built a word generator, it picks a length and then randomly picks letters of the alphabet to make up words.

The program works but 99% of the output is rubbish as it is not observing the constructs of the English language, I am getting as many words with x and z in as I do e.

What are my options for biasing the RNG so it will use common letters more often.

I am using rand() from the stl seeded with the time.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

寂寞笑我太脆弱 2024-12-04 02:11:06

输出仍然是垃圾,因为偏置随机数生成器不足以构造正确的英语单词。但使 rng 产生偏差的一种方法是:

  1. 制作大型英语文本(语料库)中字母出现次数的直方图。您会得到类似 500 个“e”、3 个“x”、1 个“q”、450 个“a”、200 个“b”等。
  2. 将间隔划分为多个范围,其中每个字母都有一个切片,切片的长度是间隔中出现的次数。 a 得到 [0-450), b [450,650), ..., q [3500,3501)。
  3. 生成一个介于 0 和间隔总长度之间的随机数,并检查它落在哪里。 450-650 之间的任何数字都会给你 ab,但只有 3500 会给你一个“q”。

The output will still be rubbish because biasing the random number generator is not enough to construct proper English words. But one approach to biasing the rng is:

  1. Make a histogram of the occurences of letters in a large English text (the corpus). You'll get something like 500 'e', 3 'x', 1 'q', 450 'a', 200 'b' and so on.
  2. Divide an interval into ranges where each letter gets a slice, with the length of the slice being the number of occurences in the interval. a gets [0-450), b [450,650), ..., q [3500,3501).
  3. Generate a random number between 0 and the total length of the interval and check where it lands. Any number within 450-650 gives you a b, but only 3500 gives you a 'q'.
挽袖吟 2024-12-04 02:11:06

一种方法是使用字母频率。为每个字母定义一个范围:a = [0, 2](如果字母“a”有 2% 的机会被使用),b = [2, 5](3% 的机会),依此类推..然后生成0 到 100 之间的随机数并选择一个字母。

另一种方法是使用非确定性有限自动机,您可以在其中定义某些转换(您可以解析圣经并构建概率)。所以你有很多像这样的转换:例如从“a”到“b”的转换是5%。然后你遍历自动机并生成一些单词。

我刚刚看到正确的术语是马尔可夫链,它可能比 NFA 更好。

One method would be to use the letter frequency. For each letter define a range: a = [0, 2] (if the letter 'a' has 2% chance of being used), b = [2, 5] (3% chance), and so forth.. then generate a random number between 0 and 100 and choose a letter.

An other method is to use a nondeterministic finite automata where you can define certain transitions (you could parse the bible and build your probability). So you have a lot of transitions like this: e.g. the transition from 'a' to 'b' is 5%. Then you walk through the automata and generate some words.

I just saw that the proper term is markov chain, which is probably better than a NFA.

满天都是小星星 2024-12-04 02:11:06

您可以对某些文本正文进行 n-gram 分析,并将其用作基础为了偏见。您可以通过字母或音节来完成此操作。按音节进行分析可能更复杂。

用字母来做,很容易。您迭代源文本中的每个字符并跟踪您遇到的最后 n-1 个字符。然后,对于下一个字符,将最后 n-1 个字符和这个新字符(n 元语法)添加到频率表中。

这个频率表是什么样的?您可以使用将 n 元语法映射到其频率的映射。但这种方法对于我下面建议的算法来说并不是很好。为此,最好将每个 (n-1)-gram 映射到 n-gram 的最后一个字母与其频率的映射。类似于:std::map>

进行分析并收集统计数据后,算法将如下所示:

  1. 选择一个随机的起始 n 元语法。您之前的分析可能包含加权数据,这些数据通常以字母开头;
  2. 从所有以前 n-1 个字母开头的 n 元语法中,随机选择最后一个字母(考虑分析中的权重);
  3. 重复直到到达单词结尾(使用预定义的长度或来自有关单词结尾频率的数据);

要从一组具有不同权重的值中选取随机值,您可以首先设置累积频率表。然后,您在小于频率总和的范围内选择一个随机数,并查看它落在哪个区间。

例如:

  • A出现10次;
  • B出现7次;
  • C出现9次;

您构建下表:{ A: 10, B: 17, C: 26 }。你选择一个 1 到 26 之间的数字。如果小于 10,则为 A;如果小于,则为 A。如果大于或等于10,但小于17,则为B;如果大于17,则为C。

You can do an n-gram analysis of some body of text and use that as a base for the bias. You can do this either by letters or by syllables. Doing the analysis by syllables is probably more complicated.

To do it by letters, it's easy. You iterate through each character in the source text and keep track of the last n-1 characters you came across. Then, for each next character, you add the last n-1 characters and this new one (a n-gram) to your table of frequencies.

What does this table of frequencies look like? You can use a map mapping the n-grams to their frequencies. But this approach is not very good for the algorithm I suggest below. For that it's better to map each (n-1)-grams to a map of the last letter of an n-gram to its frequency. Something like: std::map<std::string, std::map<char,int>>.

Having made the analysis and collected the statistics, the algorithm would go like this:

  1. pick a random starting n-gram. Your previous analysis may contain weighted data for which letters usually start words;
  2. from all the n-grams that start with previous n-1 letters, pick a random last letter (considering the weights from the analysis);
  3. repeat until you reach the end of a word (either using a predefined length or from data about word ending frequencies);

To pick random values from a set of values with different weights, you can start by setting up a table of the cumulative frequencies. Then you pick a random number between less than the sum of the frequencies, and see in what interval it falls.

For example:

  • A happens 10 times;
  • B happens 7 times;
  • C happens 9 times;

You build the following table: { A: 10, B: 17, C: 26 }. You pick a number between 1 and 26. If it is less than 10, it's A; if it's greater or equal to 10, but less than 17, it's B; if it's greater than 17, it's C.

神爱温柔 2024-12-04 02:11:06

您可能希望使用英语的字母频率来获得更真实的输出:http://en.wikipedia。 org/wiki/Letter_Frequency

但如果您想要可发音的单词,您可能应该从音节生成它们。您可以在线找到更多信息,例如在这里:http://spell.psychology.wustl.edu /SyllStructDistPhon/CVC.html

You may want to use the English language's letter frequency to have a more realistic output : http://en.wikipedia.org/wiki/Letter_frequency.

But if you want pronounceable words, you should probably generate them from syllabes. You can find more information online, e.g. here : http://spell.psychology.wustl.edu/SyllStructDistPhon/CVC.html

谁许谁一生繁华 2024-12-04 02:11:06

您可以通过阅读源文本得出马尔可夫模型,然后生成“类似”的单词来源。

这也适用于从单词生成句子。嗯,有点作品。

You could derive a Markov Model be reading a source text and then generate words which are "like" the source.

This also works for generating sentences from words. Well, sort of works.

没有你我更好 2024-12-04 02:11:06

如果您只想更改单词中的字母频率,而不进行进一步的词汇分析(如 qu 对),请获取英语字母频率列表。

然后创建一个加权随机生成器,它将有更多机会输出 e(七分之一的机会),而不是 x(大约千分之一)。

生成加权随机生成器(rand 生成整数,IIRC):
1.标准化字母频率,使它们都是整数(对于维基百科频率基本上乘以100000)
2. 制作某种查找表,为每个字母分配一定的范围,如下表所示

letter  | weight  |  start   |    end
a       |   8.17% |      0   |   8167
b       |   1.49% |   8168   |   9659
c       |   2.78% |   9660   |  12441
d       |   4.25% |  12442   |  16694
e       |  12.70% |  16695   |  29396
f       |   2.23% |  29397   |  31624
g       |   2.02% |  31625   |  33639
.....
z       |   0.07% | 99926    |  99999

3. 生成 0 到 99999 之间的随机数,并用它来查找相应的字母。这样,您将获得正确的字母频率。

If you want to change just the letter frequency in the words, without futher lexical analisys (like the qu pair), get a list of english language letter frequencies.

Then create a weighted random generator, that will have more chance to output an e (1 in 7 chance) that a x (around 1 in a 1000).

To generate a weighted random generator (rand generates integers, IIRC):
1. Normalize the letter frequencies, so that they are all integers (for the Wikipedia frequencies basically multiply by 100000)
2. Make some sort of lookup table, where to each letter you assign a certain range, like the table below

letter  | weight  |  start   |    end
a       |   8.17% |      0   |   8167
b       |   1.49% |   8168   |   9659
c       |   2.78% |   9660   |  12441
d       |   4.25% |  12442   |  16694
e       |  12.70% |  16695   |  29396
f       |   2.23% |  29397   |  31624
g       |   2.02% |  31625   |  33639
.....
z       |   0.07% | 99926    |  99999

3. Generate a random number between 0 and 99999, and use that to find the corresponding letter. This way, you will have the correct letter frequencies.

离去的眼神 2024-12-04 02:11:06

首先,你需要一张表格,上面有字母和它们的重量之类的东西
就像:

struct WeightedLetter
{
    char letter;
    int  weight;
};

static WeightedLetter const letters[] =
{
    { 'a', 82 },
    { 'b', 15 },
    { 'c', 28 },
    //  ...
};

char getLetter()
{
    int totalWeight = 0;
    for ( WeightedLetter const* iter = begin( letters );
            iter != end( letters );
            ++ iter ) {
        totalWeight += iter->weight;
    }
    int choice = rand() % totalWeight;
                // but you probably want a better generator
    WeightedLetter const* result = begin( letters );
    while ( choice > result->weight ) {
        choice -= result->weight;
        ++ result;
    }
    return result->letter;
}

这只是我的想法,所以它可能包含错误;
至少,第二个循环需要一些验证。但它
应该给你基本的想法。

当然,这仍然不会产生类似英语的单词。这
序列“uq”与“qu”的可能性一样,并且没有什么可以阻止的
没有元音的单词,或只有元音的十个字母的单词。关于英语音系的维基百科页面有一些关于什么组合可以在哪里出现的很好的信息,但是它没有任何关于它们的统计数据。另一方面,如果您试图组成可能的单词,例如 Jabberwocky,那么这可能不是问题:选择随机数量的音节,从 1 到某个最大值,然后是开头、核心和结尾。 (不要忘记开头和尾声可以是空的。)

First, you need a table with the letters and their weights, something
like:

struct WeightedLetter
{
    char letter;
    int  weight;
};

static WeightedLetter const letters[] =
{
    { 'a', 82 },
    { 'b', 15 },
    { 'c', 28 },
    //  ...
};

char getLetter()
{
    int totalWeight = 0;
    for ( WeightedLetter const* iter = begin( letters );
            iter != end( letters );
            ++ iter ) {
        totalWeight += iter->weight;
    }
    int choice = rand() % totalWeight;
                // but you probably want a better generator
    WeightedLetter const* result = begin( letters );
    while ( choice > result->weight ) {
        choice -= result->weight;
        ++ result;
    }
    return result->letter;
}

This is just off the top of my head, so it's likely to contain errors;
at the very least, the second loop requires some verification. But it
should give you the basic idea.

Of course, this still isn't going to result in English-like words. The
sequence "uq" is just as likely as "qu", and there's nothing to prevent
a word without a vowel, or a ten letter word with just vowels. The Wikipedia page on English Phonology has some good information as to what combinations can occur where, but it doesn't have any statistics on them. On the other hand, if you're trying to make up possible words, like Jabberwocky, then that may not be a problem: choose a random number of syllables, from 1 to some maximum, then an onset, a nucleus and a coda. (Don't forget that the onset and the coda can be empty.)

任性一次 2024-12-04 02:11:06

如果您想创建可发音的单词,请不要尝试将字母连接在一起。

加入声音。列出可供选择的声音列表:“abe”、“ape”、“gre”等

If you want to create pronounceable words do not try and join letters together.

Join sounds. Make a list of sounds to select from:"abe", "ape", "gre" etc

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文