如何在给定一个输入单词的情况下生成发音相似的单词列表?
当您在 Google 中拼错一个单词(例如“apples”)时,它会为您提供现在熟悉的“您的意思是:apples”建议。
排除谷歌根据搜索结果的相关性猜测你的意图的能力,我怎样才能开发出听起来相同的单词列表?
这些词不一定是英语,也不一定存在。因此,例如,如果我输入“hole”,我会得到一个包含以下单词的列表:“whole”“hola”“whore”“role”“molar”等...
我猜可能有网上有一些可以制定此列表的东西,但我找不到任何东西。如果没有站点并且可以使用 Perl 完成,是否有 CPAN 模块可以帮助我做到这一点?
When you misspell a word in Google ("appples" for example), it comes up with the now familiar, "Did you mean: apples" suggestion for you.
Excluding Google's ability to guess your intentions based on relevance of search results, how can I develop a list of words that sound the same?
The words don't have to be English and also do not have to exist. So, for example, if I give the input "hole", I would get back a list including words like: "whole" "hola" "whore" "role" "molar", etc...
I am guessing there might be something online that can develop this list, but I couldn't find anything. If there is not a site and if it can be done using Perl, is there a CPAN module that can help me do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
如果您确实正在寻找听起来相同的单词,而不仅仅是搜索建议 - 您可以查看语音算法。 Soundex 和 Metaphone/Double Metaphone 是两个非常常见的,并且每个流行语言都有实现。
这些算法将单词简化为指示其发音的“键”。如果您从一个单词语料库开始并构建一个数据结构,将这些键映射到评估它们的单词 - 您可以采用任意字符串,将其评估为它的“键”,然后查找评估为相同值的其他单词数据结构中的键(可能是列表的哈希表或类似的)。
这并不完美,因为您需要找到一个大的单词语料库来为数据集播种,但它可以工作。
另一方面,如果您只是想要搜索建议/替代拼写,则有更简单的方法可以实现。
希望这有帮助。
If you are truly looking for words that sound the same, and not just search suggestions - you can look at phonetic algorithms. Soundex and Metaphone/Double Metaphone are two very common ones and there are implementations of each in any popular language.
These algorithms reduce a word down to a "key" that indicates its pronunciation. If you took a corpus of words to start and built a data structure mapping these keys to words that evaluate to them- you could take an arbitrary string, evaluate it down to its "key" and then look up other words that evaluate to the same key in your data structure (probably a hash table of lists or similar).
This isn't perfect, because you'd need to find a big corpus of words to seed your dataset with, but it would work.
On the other hand, if you simply want search suggestions/alternate spellings there are easier ways to go about it.
Hope that was helpful.
您可以从了解模块 Text::Soundex 开始。这是一个将单词映射到 4 字节代码的简单算法。我很久以前就从 Sedgewick(前 Knuth)那里得到了 Soundex,用它来生成更长的密钥(未截断)并建议了 0 和 1 字母替换的更正列表。我将其应用于人口普查和邮政数据的大型数据库。
You can start by learning about the module Text::Soundex . It is a simple algorithm that maps words to 4 byte codes. I got Soundex out of Sedgewick (ex Knuth) long ago, used it to generate longer keys (not truncated) and suggested lists of corrections for 0 and 1-letter substitutions. I applied this to large databases of census and postal data.