偏置随机数生成器的可能方法是什么?
我构建了一个单词生成器,它选择一个长度,然后随机选择字母表中的字母来组成单词。
该程序可以工作,但 99% 的输出都是垃圾,因为它没有遵守英语语言的结构,我得到的 x 和 z 单词与 e 一样多。
我有哪些选项可以让 RNG 产生偏差,以便它更频繁地使用常用字母。
我正在使用带有时间种子的 stl 中的 rand() 。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
输出仍然是垃圾,因为偏置随机数生成器不足以构造正确的英语单词。但使 rng 产生偏差的一种方法是:
The output will still be rubbish because biasing the random number generator is not enough to construct proper English words. But one approach to biasing the rng is:
一种方法是使用字母频率。为每个字母定义一个范围:a = [0, 2](如果字母“a”有 2% 的机会被使用),b = [2, 5](3% 的机会),依此类推..然后生成0 到 100 之间的随机数并选择一个字母。
另一种方法是使用非确定性有限自动机,您可以在其中定义某些转换(您可以解析圣经并构建概率)。所以你有很多像这样的转换:例如从“a”到“b”的转换是5%。然后你遍历自动机并生成一些单词。
我刚刚看到正确的术语是马尔可夫链,它可能比 NFA 更好。
One method would be to use the letter frequency. For each letter define a range: a = [0, 2] (if the letter 'a' has 2% chance of being used), b = [2, 5] (3% chance), and so forth.. then generate a random number between 0 and 100 and choose a letter.
An other method is to use a nondeterministic finite automata where you can define certain transitions (you could parse the bible and build your probability). So you have a lot of transitions like this: e.g. the transition from 'a' to 'b' is 5%. Then you walk through the automata and generate some words.
I just saw that the proper term is markov chain, which is probably better than a NFA.
您可以对某些文本正文进行 n-gram 分析,并将其用作基础为了偏见。您可以通过字母或音节来完成此操作。按音节进行分析可能更复杂。
用字母来做,很容易。您迭代源文本中的每个字符并跟踪您遇到的最后 n-1 个字符。然后,对于下一个字符,将最后 n-1 个字符和这个新字符(n 元语法)添加到频率表中。
这个频率表是什么样的?您可以使用将 n 元语法映射到其频率的映射。但这种方法对于我下面建议的算法来说并不是很好。为此,最好将每个 (n-1)-gram 映射到 n-gram 的最后一个字母与其频率的映射。类似于:
std::map>
。进行分析并收集统计数据后,算法将如下所示:
要从一组具有不同权重的值中选取随机值,您可以首先设置累积频率表。然后,您在小于频率总和的范围内选择一个随机数,并查看它落在哪个区间。
例如:
您构建下表:{ A: 10, B: 17, C: 26 }。你选择一个 1 到 26 之间的数字。如果小于 10,则为 A;如果小于,则为 A。如果大于或等于10,但小于17,则为B;如果大于17,则为C。
You can do an n-gram analysis of some body of text and use that as a base for the bias. You can do this either by letters or by syllables. Doing the analysis by syllables is probably more complicated.
To do it by letters, it's easy. You iterate through each character in the source text and keep track of the last n-1 characters you came across. Then, for each next character, you add the last n-1 characters and this new one (a n-gram) to your table of frequencies.
What does this table of frequencies look like? You can use a map mapping the n-grams to their frequencies. But this approach is not very good for the algorithm I suggest below. For that it's better to map each (n-1)-grams to a map of the last letter of an n-gram to its frequency. Something like:
std::map<std::string, std::map<char,int>>
.Having made the analysis and collected the statistics, the algorithm would go like this:
To pick random values from a set of values with different weights, you can start by setting up a table of the cumulative frequencies. Then you pick a random number between less than the sum of the frequencies, and see in what interval it falls.
For example:
You build the following table: { A: 10, B: 17, C: 26 }. You pick a number between 1 and 26. If it is less than 10, it's A; if it's greater or equal to 10, but less than 17, it's B; if it's greater than 17, it's C.
您可能希望使用英语的字母频率来获得更真实的输出:http://en.wikipedia。 org/wiki/Letter_Frequency。
但如果您想要可发音的单词,您可能应该从音节生成它们。您可以在线找到更多信息,例如在这里:http://spell.psychology.wustl.edu /SyllStructDistPhon/CVC.html
You may want to use the English language's letter frequency to have a more realistic output : http://en.wikipedia.org/wiki/Letter_frequency.
But if you want pronounceable words, you should probably generate them from syllabes. You can find more information online, e.g. here : http://spell.psychology.wustl.edu/SyllStructDistPhon/CVC.html
您可以通过阅读源文本得出马尔可夫模型,然后生成“类似”的单词来源。
这也适用于从单词生成句子。嗯,有点作品。
You could derive a Markov Model be reading a source text and then generate words which are "like" the source.
This also works for generating sentences from words. Well, sort of works.
如果您只想更改单词中的字母频率,而不进行进一步的词汇分析(如
qu
对),请获取英语字母频率列表。然后创建一个加权随机生成器,它将有更多机会输出
e
(七分之一的机会),而不是x
(大约千分之一)。生成加权随机生成器(rand 生成整数,IIRC):
1.标准化字母频率,使它们都是整数(对于维基百科频率基本上乘以100000)
2. 制作某种查找表,为每个字母分配一定的范围,如下表所示
3. 生成 0 到 99999 之间的随机数,并用它来查找相应的字母。这样,您将获得正确的字母频率。
If you want to change just the letter frequency in the words, without futher lexical analisys (like the
qu
pair), get a list of english language letter frequencies.Then create a weighted random generator, that will have more chance to output an
e
(1 in 7 chance) that ax
(around 1 in a 1000).To generate a weighted random generator (rand generates integers, IIRC):
1. Normalize the letter frequencies, so that they are all integers (for the Wikipedia frequencies basically multiply by 100000)
2. Make some sort of lookup table, where to each letter you assign a certain range, like the table below
3. Generate a random number between 0 and 99999, and use that to find the corresponding letter. This way, you will have the correct letter frequencies.
首先,你需要一张表格,上面有字母和它们的重量之类的东西
就像:
这只是我的想法,所以它可能包含错误;
至少,第二个循环需要一些验证。但它
应该给你基本的想法。
当然,这仍然不会产生类似英语的单词。这
序列“uq”与“qu”的可能性一样,并且没有什么可以阻止的
没有元音的单词,或只有元音的十个字母的单词。关于英语音系的维基百科页面有一些关于什么组合可以在哪里出现的很好的信息,但是它没有任何关于它们的统计数据。另一方面,如果您试图组成可能的单词,例如 Jabberwocky,那么这可能不是问题:选择随机数量的音节,从 1 到某个最大值,然后是开头、核心和结尾。 (不要忘记开头和尾声可以是空的。)
First, you need a table with the letters and their weights, something
like:
This is just off the top of my head, so it's likely to contain errors;
at the very least, the second loop requires some verification. But it
should give you the basic idea.
Of course, this still isn't going to result in English-like words. The
sequence "uq" is just as likely as "qu", and there's nothing to prevent
a word without a vowel, or a ten letter word with just vowels. The Wikipedia page on English Phonology has some good information as to what combinations can occur where, but it doesn't have any statistics on them. On the other hand, if you're trying to make up possible words, like Jabberwocky, then that may not be a problem: choose a random number of syllables, from 1 to some maximum, then an onset, a nucleus and a coda. (Don't forget that the onset and the coda can be empty.)
如果您想创建可发音的单词,请不要尝试将字母连接在一起。
加入声音。列出可供选择的声音列表:“abe”、“ape”、“gre”等
If you want to create pronounceable words do not try and join letters together.
Join sounds. Make a list of sounds to select from:"abe", "ape", "gre" etc