当前位置：文江博客话题详情

如何确定随机字符串听起来是否像英语？

发布于 2024-07-05 07:52:38 字数 209 浏览 12 评论 0原文

我有一个算法，可以根据输入单词列表生成字符串。如何仅分离听起来像英语单词的字符串？ IE。丢弃RDLO，同时保留LORD。

编辑：澄清一下，它们不需要是字典中的实际单词。他们只需要听起来像英语即可。例如KEAL将被接受。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

骄兵必败 2024-07-12 07:52:38

您可以构建一个巨大的英文文本的马尔可夫链。

然后，您可以将单词输入马尔可夫链并检查该单词是英语的概率有多大。

请参阅此处：http://en.wikipedia.org/wiki/Markov_chain

在底部页面您可以看到马尔可夫文本生成器。你想要的恰恰相反。

简而言之：马尔可夫链为每个字符存储下一个字符跟随的概率。如果你有足够的内存，你可以将这个想法扩展到两个或三个字符。

回复收藏 0 原文

如日中天 2024-07-12 07:52:38

使用贝叶斯过滤器的简单方法（来自 http://sebsauvage.net/python/snyppets/#bayesian 的 Python 示例)

from reverend.thomas import Bayes
guesser = Bayes()
guesser.train('french','La souris est rentrée dans son trou.')
guesser.train('english','my tailor is rich.')
guesser.train('french','Je ne sais pas si je viendrai demain.')
guesser.train('english','I do not plan to update my website soon.')

>>> print guesser.guess('Jumping out of cliffs it not a good idea.')
[('english', 0.99990000000000001), ('french', 9.9999999999988987e-005)]

>>> print guesser.guess('Demain il fera très probablement chaud.')
[('french', 0.99990000000000001), ('english', 9.9999999999988987e-005)]

The easy way with Bayesian filters (Python example from http://sebsauvage.net/python/snyppets/#bayesian)

from reverend.thomas import Bayes
guesser = Bayes()
guesser.train('french','La souris est rentrée dans son trou.')
guesser.train('english','my tailor is rich.')
guesser.train('french','Je ne sais pas si je viendrai demain.')
guesser.train('english','I do not plan to update my website soon.')

>>> print guesser.guess('Jumping out of cliffs it not a good idea.')
[('english', 0.99990000000000001), ('french', 9.9999999999988987e-005)]

>>> print guesser.guess('Demain il fera très probablement chaud.')
[('french', 0.99990000000000001), ('english', 9.9999999999988987e-005)]

回复收藏 0 原文

孤凫 2024-07-12 07:52:38

您可以通过将候选字符串标记为 bigrams（相邻字母对）并检查每个字母来解决此问题二元词与英语二元词频率表的比较。

简单：如果任何二元组在频率表上足够低（或完全不存在），则拒绝该字符串，因为该字符串不可信。（字符串包含“QZ”二元组？拒绝！）
不太简单：计算整个字符串的整体合理性，例如，每个二元组的频率除以该长度的有效英语字符串的平均频率的乘积。这将允许您（a）接受一个在其他高频二元组中具有奇数低频二元组的字符串，并且（b）拒绝具有多个单独的低但不完全低于阈值二元组的字符串。

其中任何一个都需要对阈值进行一些调整，第二种技术比第一种技术更重要。

对三元组做同样的事情可能会更稳健，尽管它也可能会导致一组更严格的“有效”字符串。这是否成功取决于您的应用程序。

基于现有研究语料库的二元组和三元组表可以免费或购买（我没有找到任何免费可用的，但到目前为止只进行了粗略的谷歌搜索），但是您可以从任何好的地方自己计算二元组或三元组表 -大小的英文文本语料库。只需将每个单词作为标记进行计算，然后计算每个二元组 - 您可以将其处理为散列，其中给定的二元组作为键，递增的整数计数器作为值。

英语形态学和英语语音学（众所周知！）小于等距，因此这种技术很可能生成“看起来”是英语但发音却很麻烦的字符串。这是支持三元组而不是二元组的另一个论据——如果 n 元组跨越整个声音，则通过按顺序使用多个字母来产生给定音素的声音分析所产生的奇怪性将会减少。（例如，想想“犁”或“海啸”。）

回复收藏 0 原文

暮年慕年 2024-07-12 07:52:38

使用马尔可夫链生成发音英语的单词非常容易。然而，倒退是一个更大的挑战。结果可接受的误差范围是多少？你总是可以有一个常见字母对、三元组等的列表，并据此对它们进行评分。

回复收藏 0 原文

傲性难收 2024-07-12 07:52:38

您应该研究“可发音的”密码生成器，因为它们试图完成相同的任务。

Perl 解决方案是 Crypt::PassGen，其中您可以使用字典进行训练（因此如果需要，您可以将其训练为各种语言）。它遍历字典并收集 1、2 和 3 个字母序列的统计数据，然后根据相对频率构建新的“单词”。

回复收藏 0 原文

遥远的绿洲 2024-07-12 07:52:38

我很想在英语单词词典上运行 soundex 算法并缓存结果，然后对您的候选字符串进行 soundex 并与缓存进行匹配。

根据性能要求，您可以为 soundex 代码制定距离算法并接受一定容差内的字符串。

Soundex 非常容易实现 - 请参阅 Wikipedia 了解该算法的描述。

您想要执行的操作的示例实现是：

def soundex(name, len=4):
    digits = '01230120022455012623010202'
    sndx = ''
    fc = ''

    for c in name.upper():
        if c.isalpha():
            if not fc: fc = c
            d = digits[ord(c)-ord('A')]
            if not sndx or (d != sndx[-1]):
                sndx += d

    sndx = fc + sndx[1:]
    sndx = sndx.replace('0','')
    return (sndx + (len * '0'))[:len]

real_words = load_english_dictionary()
soundex_cache = [ soundex(word) for word in real_words ]

if soundex(candidate) in soundex_cache:
    print "keep"
else:
    print "discard"

显然您需要提供 read_english_dictionary 的实现。

编辑：您的“KEAL”示例会很好，因为它具有与“KEEL”相同的 soundex 代码（K400）。如果您想了解失败率，您可能需要记录被拒绝的单词并手动验证它们。

I'd be tempted to run the soundex algorithm over a dictionary of English words and cache the results, then soundex your candidate string and match against the cache.

Depending on performance requirements, you could work out a distance algorithm for soundex codes and accept strings within a certain tolerance.

Soundex is very easy to implement - see Wikipedia for a description of the algorithm.

An example implementation of what you want to do would be:

def soundex(name, len=4):
    digits = '01230120022455012623010202'
    sndx = ''
    fc = ''

    for c in name.upper():
        if c.isalpha():
            if not fc: fc = c
            d = digits[ord(c)-ord('A')]
            if not sndx or (d != sndx[-1]):
                sndx += d

    sndx = fc + sndx[1:]
    sndx = sndx.replace('0','')
    return (sndx + (len * '0'))[:len]

real_words = load_english_dictionary()
soundex_cache = [ soundex(word) for word in real_words ]

if soundex(candidate) in soundex_cache:
    print "keep"
else:
    print "discard"

Obviously you'll need to provide an implementation of read_english_dictionary.

EDIT: Your example of "KEAL" will be fine, since it has the same soundex code (K400) as "KEEL". You may need to log rejected words and manually verify them if you want to get an idea of failure rate.

回复收藏 0 原文