构建一个更真实的随机词生成器?
我见过很多使用 马尔可夫链 根据源数据生成随机单词的示例,但它们对我来说常常显得有点过于机械和抽象。我正在努力开发一种更好的。
我认为部分问题在于它们完全依赖于成对的整体统计出现,而忽略了单词以某种方式开始和结束的趋势。例如,如果您使用前 1000 个婴儿名字作为源数据,则字母 J 总体上相对较少,但它是名字开头的第二个最常见的字母。或者,如果您使用拉丁语源数据,像 -um 和 -us 这样的词尾是常见的词尾,但如果您认为所有词对都相同,则就不那么常见了。
因此,我基本上试图组合一个基于马尔可夫链的单词生成器,该生成器考虑了源数据中单词开始和结束的方式。
从概念上讲,这对我来说是有意义的,但我无法弄清楚如何从软件角度实现这一点。我正在尝试组合一个小型 PHP 工具,允许您放入源数据(例如,1000 个单词的列表),然后从中生成各种具有真实开头、中间和结尾的随机单词。 (与大多数基于马尔可夫的单词生成器相反,它们仅基于整体对的统计出现情况。)
如果可能的话,我还想使用由源数据确定的单词长度来执行此操作;即,随机生成的字的长度细分应该与源数据的长度细分大致相同。
任何想法将不胜感激!谢谢。
I'm seen many examples of using Markov chains for generating random words based on source data, but they often seem a bit overly mechanical and abstract to me. I'm trying to develop a better one.
I believe part of the problem is that they rely entirely on the overall statistical occurrence of pairs, and ignore the tendency of words to start and end in certain ways. For example, if you use the top 1000 baby names as source data, the letter J is relatively rare overall, yet it's the second most common letter for names to start with. Or, if you're using Latin source data, word endings like -um and -us would be common endings, but not as common if you consider all pairs the same.
So, I'm basically trying to put together a Markov chain based word generator that takes into account the way words start and end in the source data.
Conceptually, that makes sense to me, yet I can't figure out how to implement this from a software perspective. I'm trying to put together a little PHP tool that allows you to drop in source data (e.g., a list of 1000 words) from which it will then generate a variety of random words with realistic starts, middles, and endings. (As opposed to most Markov-based word generators, which are just based on the statistical occurrence of pairs overall.)
I'd also like to do this with word length determined by the source data, if possible; i.e., the length breakdown of the randomly generated words should be approximately the same as the length breakdown of the source data.
Any ideas would be massively appreciated! Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您将“单词之间的空格”视为一个符号,那么关于不尊重常见开头和结尾的部分实际上并不正确 - 常见开头在“单词之间的空格”之后将具有高频率,而常见结尾将在“空格”之前具有高频率言语之间”。正确的单词长度或多或少也会自然地解决——在转换为“单词之间的空格”符号之前输出的平均字母数应该等于训练数据中每个单词的平均字母数,尽管有些东西我的内心深处告诉我,分布可能已关闭。
The part about not respecting common beginnings and endings isn't actually true if you consider "space between words" to be a symbol -- common beginnings will have high frequencies following "space between words" and common endings will have high frequencies preceding "space between words". Correct word length also settles out of that more-or-less naturally -- the mean number of letters you output before transitioning to a "space between words" symbol should equal the mean number of letters per word in the training data, although something in the back of my mind is telling me that the distribution might be off.