如何确定随机字符串听起来是否像英语?
我有一个算法,可以根据输入单词列表生成字符串。 如何仅分离听起来像英语单词的字符串? IE。 丢弃RDLO,同时保留LORD。
编辑:澄清一下,它们不需要是字典中的实际单词。 他们只需要听起来像英语即可。 例如KEAL将被接受。
I have an algorithm that generates strings based on a list of input words. How do I separate only the strings that sounds like English words? ie. discard RDLO while keeping LORD.
EDIT: To clarify, they do not need to be actual words in the dictionary. They just need to sound like English. For example KEAL would be accepted.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(13)
您可以构建一个巨大的英文文本的马尔可夫链。
然后,您可以将单词输入马尔可夫链并检查该单词是英语的概率有多大。
请参阅此处:http://en.wikipedia.org/wiki/Markov_chain
在底部页面您可以看到马尔可夫文本生成器。 你想要的恰恰相反。
简而言之:马尔可夫链为每个字符存储下一个字符跟随的概率。 如果你有足够的内存,你可以将这个想法扩展到两个或三个字符。
You can build a markov-chain of a huge english text.
Afterwards you can feed words into the markov chain and check how high the probability is that the word is english.
See here: http://en.wikipedia.org/wiki/Markov_chain
At the bottom of the page you can see the markov text generator. What you want is exactly the reverse of it.
In a nutshell: The markov-chain stores for each character the probabilities of which next character will follow. You can extend this idea to two or three characters if you have enough memory.
使用贝叶斯过滤器的简单方法(来自 http://sebsauvage.net/python/snyppets/#bayesian 的 Python 示例)
The easy way with Bayesian filters (Python example from http://sebsauvage.net/python/snyppets/#bayesian)
您可以通过将候选字符串标记为 bigrams(相邻字母对)并检查每个字母来解决此问题二元词与英语二元词频率表的比较。
其中任何一个都需要对阈值进行一些调整,第二种技术比第一种技术更重要。
对三元组做同样的事情可能会更稳健,尽管它也可能会导致一组更严格的“有效”字符串。 这是否成功取决于您的应用程序。
基于现有研究语料库的二元组和三元组表可以免费或购买(我没有找到任何免费可用的,但到目前为止只进行了粗略的谷歌搜索),但是您可以从任何好的地方自己计算二元组或三元组表 -大小的英文文本语料库。 只需将每个单词作为标记进行计算,然后计算每个二元组 - 您可以将其处理为散列,其中给定的二元组作为键,递增的整数计数器作为值。
英语形态学和英语语音学(众所周知!)小于等距,因此这种技术很可能生成“看起来”是英语但发音却很麻烦的字符串。 这是支持三元组而不是二元组的另一个论据——如果 n 元组跨越整个声音,则通过按顺序使用多个字母来产生给定音素的声音分析所产生的奇怪性将会减少。 (例如,想想“犁”或“海啸”。)
You could approach this by tokenizing a candidate string into bigrams—pairs of adjascent letters—and checking each bigram against a table of English bigram frequencies.
Either of those would require some tuning of the threshold(s), the second technique more so than the first.
Doing the same thing with trigrams would likely be more robust, though it'll also likely lead to a somewhat more strict set of "valid" strings. Whether that's a win or not depends on your application.
Bigram and trigram tables based on existing research corpora may be available for free or purchase (I didn't find any freely available but only did a cursory google so far), but you can calculate a bigram or trigram table from yourself from any good-sized corpus of English text. Just crank through each word as a token and tally up each bigram—you might handle this as a hash with a given bigram as the key and an incremented integer counter as the value.
English morphology and English phonetics are (famously!) less than isometric, so this technique might well generate strings that "look" English but present troublesome prounciations. This is another argument for trigrams rather than bigrams—the weirdness produced by analysis of sounds that use several letters in sequence to produce a given phoneme will be reduced if the n-gram spans the whole sound. (Think "plough" or "tsunami", for example.)
使用马尔可夫链生成发音英语的单词非常容易。 然而,倒退是一个更大的挑战。 结果可接受的误差范围是多少? 你总是可以有一个常见字母对、三元组等的列表,并据此对它们进行评分。
It's quite easy to generate English sounding words using a Markov chain. Going backwards is more of a challenge, however. What's the acceptable margin of error for the results? You could always have a list of common letter pairs, triples, etc, and grade them based on that.
您应该研究“可发音的”密码生成器,因为它们试图完成相同的任务。
Perl 解决方案是 Crypt::PassGen,其中您可以使用字典进行训练(因此如果需要,您可以将其训练为各种语言)。 它遍历字典并收集 1、2 和 3 个字母序列的统计数据,然后根据相对频率构建新的“单词”。
You should research "pronounceable" password generators, since they're trying to accomplish the same task.
A Perl solution would be Crypt::PassGen, which you can train with a dictionary (so you could train it to various languages if you need to). It walks through the dictionary and collects statistics on 1, 2, and 3-letter sequences, then builds new "words" based on relative frequencies.
我很想在英语单词词典上运行 soundex 算法并缓存结果,然后对您的候选字符串进行 soundex 并与缓存进行匹配。
根据性能要求,您可以为 soundex 代码制定距离算法并接受一定容差内的字符串。
Soundex 非常容易实现 - 请参阅 Wikipedia 了解该算法的描述。
您想要执行的操作的示例实现是:
显然您需要提供 read_english_dictionary 的实现。
编辑:您的“KEAL”示例会很好,因为它具有与“KEEL”相同的 soundex 代码(K400)。 如果您想了解失败率,您可能需要记录被拒绝的单词并手动验证它们。
I'd be tempted to run the soundex algorithm over a dictionary of English words and cache the results, then soundex your candidate string and match against the cache.
Depending on performance requirements, you could work out a distance algorithm for soundex codes and accept strings within a certain tolerance.
Soundex is very easy to implement - see Wikipedia for a description of the algorithm.
An example implementation of what you want to do would be:
Obviously you'll need to provide an implementation of read_english_dictionary.
EDIT: Your example of "KEAL" will be fine, since it has the same soundex code (K400) as "KEEL". You may need to log rejected words and manually verify them if you want to get an idea of failure rate.
Metaphone 和 Double Metaphone 与 SOUNDEX 类似,但它们可能比 SOUNDEX。 它们被设计为根据单词的语音“声音”来“散列”单词,并且擅长对英语执行此操作(但对于其他语言和专有名称则不然)。
对于这三种算法需要记住的一件事是它们对单词的第一个字母极其敏感。 例如,如果您尝试确定 KEAL 的发音是否为英语,则不会找到与 REAL 匹配的内容,因为首字母不同。
Metaphone and Double Metaphone are similar to SOUNDEX, except they may be tuned more toward your goal than SOUNDEX. They're designed to "hash" words based on their phonetic "sound", and are good at doing this for the English language (but not so much other languages and proper names).
One thing to keep in mind with all three algorithms is that they're extremely sensitive to the first letter of your word. For example, if you're trying to figure out if KEAL is English-sounding, you won't find a match to REAL because the initial letters are different.
它们必须是真正的英语单词,还是只是看起来像英语单词的字符串?
如果它们只需要看起来像可能的英语单词,您可以对一些真实的英语文本进行一些统计分析,并找出哪些字母组合经常出现。 完成此操作后,您可以丢弃不太可能的字符串,尽管其中一些可能是真实的单词。
或者您可以只使用字典并拒绝其中没有的单词(允许复数和其他变体)。
Do they have to be real English words, or just strings that look like they could be English words?
If they just need to look like possible English words you could do some statistical analysis on some real English texts and work out which combinations of letters occur frequently. Once you've done that you can throw out strings that are too improbable, although some of them may be real words.
Or you could just use a dictionary and reject words that aren't in it (with some allowances for plurals and other variations).
我建议查看 phi 测试和重合指数。 http://www.threaded.com/cryptography2.htm
I'd suggest looking at the phi test and index of coincidence. http://www.threaded.com/cryptography2.htm
我建议一些简单的规则和标准的对和三胞胎会很好。
例如,除了一些双元音和标准辅音对(例如th、ie 和ei、oo、tr)之外,英语发音单词往往遵循元音-辅音-元音的模式。 使用这样的系统,您应该删除几乎所有听起来不像英语的单词。 仔细检查后,您会发现您可能会删除很多听起来也像英语的单词,但您可以开始添加允许更广泛单词的规则并手动“训练”您的算法。
您不会删除所有漏报(例如,我认为您无法在没有明确编码节奏是一个单词的情况下设法提出包含“节奏”的规则),但它将提供一种过滤方法。
我还假设您想要的字符串可以是英语单词(它们发音时听起来很合理),而不是绝对是具有英语含义的单词的字符串。
I'd suggest a few simple rules and standard pairs and triplets would be good.
For example, english sounding words tend to follow the pattern of vowel-consonant-vowel, apart from some dipthongs and standard consonant pairs (e.g. th, ie and ei, oo, tr). With a system like that you should strip out almost all words that don't sound like they could be english. You'd find on closer inspection that you will probably strip out a lot of words that do sound like english as well, but you can then start adding rules that allow for a wider range of words and 'train' your algorithm manually.
You won't remove all false negatives (e.g. I don't think you could manage to come up with a rule to include 'rythm' without explicitly coding in that rythm is a word) but it will provide a method of filtering.
I'm also assuming that you want strings that could be english words (they sound reasonable when pronounced) rather than strings that are definitely words with an english meaning.
这听起来是一项相当复杂的任务! 在我的脑海中,辅音音素之前或之后都需要一个元音。 不过,确定什么是音素将非常困难! 您可能需要手动写出它们的列表。 例如,“TR”可以,但“TD”则不行,等等。
That sounds like quite an involved task! Off the top of my head, a consonant phoneme needs a vowel either before or after it. Determining what a phoneme is will be quite hard though! You'll probably need to manually write out a list of them. For example, "TR" is ok but not "TD", etc.
我可能会使用 SOUNDEX 算法针对英语单词数据库来评估每个单词。 如果您在 SQL 服务器上执行此操作,那么设置一个包含大多数英语单词列表的数据库(使用免费可用的字典)应该非常容易,并且 MSSQL 服务器已将 SOUNDEX 实现为可用的搜索算法。
显然,如果您愿意,您可以自己用任何语言实现这一点 - 但这可能是一项艰巨的任务。
通过这种方式,您可以评估每个单词听起来有多少与现有的英语单词(如果有)相似,并且您可以设置一些限制来限制您想要接受的结果的低度。 您可能想要考虑如何组合多个单词的结果,并且您可能会根据测试调整接受限制。
I would probably evaluate each word using a SOUNDEX algorithm against a database of english words. If you're doing this on a SQL-server it should be pretty easy to setup a database containing a list of most english words (using a freely available dictionary), and MSSQL server has SOUNDEX implemented as an available search-algorithm.
Obviously you can implement this yourself if you want, in any language - but it might be quite a task.
This way you'd get an evaluation of how much each word sounds like an existing english word, if any, and you could setup some limits for how low you'd want to accept results. You'd probably want to consider how to combine results for multiple words, and you would probably tweak the acceptance-limits based on testing.
您可以将它们与字典(可在互联网上免费获得)进行比较,但这在 CPU 使用方面可能会很昂贵。 除此之外,我不知道还有任何其他编程方式可以做到这一点。
You could compare them to a dictionary (freely available on the internet), but that may be costly in terms of CPU usage. Other than that, I don't know of any other programmatic way to do it.