如何在给定一个输入单词的情况下生成发音相似的单词列表?

发布于 2025-01-01 12:05:34 字数 308 浏览 1 评论 0原文

当您在 Google 中拼错一个单词(例如“apples”)时,它会为您提供现在熟悉的“您的意思是:apples”建议。

排除谷歌根据搜索结果的相关性猜测你的意图的能力,我怎样才能开发出听起来相同的单词列表?

这些词不一定是英语,也不一定存在。因此,例如,如果我输入“hole”,我会得到一个包含以下单词的列表:“whole”“hola”“whore”“role”“molar”等...

我猜可能有网上有一些可以制定此列表的东西,但我找不到任何东西。如果没有站点并且可以使用 Perl 完成,是否有 CPAN 模块可以帮助我做到这一点?

When you misspell a word in Google ("appples" for example), it comes up with the now familiar, "Did you mean: apples" suggestion for you.

Excluding Google's ability to guess your intentions based on relevance of search results, how can I develop a list of words that sound the same?

The words don't have to be English and also do not have to exist. So, for example, if I give the input "hole", I would get back a list including words like: "whole" "hola" "whore" "role" "molar", etc...

I am guessing there might be something online that can develop this list, but I couldn't find anything. If there is not a site and if it can be done using Perl, is there a CPAN module that can help me do this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

似狗非友 2025-01-08 12:05:34

如果您确实正在寻找听起来相同的单词,而不仅仅是搜索建议 - 您可以查看语音算法。 Soundex 和 Metaphone/Double Metaphone 是两个非常常见的,并且每个流行语言都有实现。

这些算法将单词简化为指示其发音的“键”。如果您从一个单词语料库开始并构建一个数据结构,将这些键映射到评估它们的单词 - 您可以采用任意字符串,将其评估为它的“键”,然后查找评估为相同值的其他单词数据结构中的键(可能是列表的哈希表或类似的)。

这并不完美,因为您需要找到一个大的单词语料库来为数据集播种,但它可以工作。

另一方面,如果您只是想要搜索建议/替代拼写,则有更简单的方法可以实现。

希望这有帮助。

If you are truly looking for words that sound the same, and not just search suggestions - you can look at phonetic algorithms. Soundex and Metaphone/Double Metaphone are two very common ones and there are implementations of each in any popular language.

These algorithms reduce a word down to a "key" that indicates its pronunciation. If you took a corpus of words to start and built a data structure mapping these keys to words that evaluate to them- you could take an arbitrary string, evaluate it down to its "key" and then look up other words that evaluate to the same key in your data structure (probably a hash table of lists or similar).

This isn't perfect, because you'd need to find a big corpus of words to seed your dataset with, but it would work.

On the other hand, if you simply want search suggestions/alternate spellings there are easier ways to go about it.

Hope that was helpful.

假面具 2025-01-08 12:05:34

您可以从了解模块 Text::Soundex 开始。这是一个将单词映射到 4 字节代码的简单算法。我很久以前就从 Sedgewick(前 Knuth)那里得到了 Soundex,用它来生成更长的密钥(未截断)并建议了 0 和 1 字母替换的更正列表。我将其应用于人口普查和邮政数据的大型数据库。

You can start by learning about the module Text::Soundex . It is a simple algorithm that maps words to 4 byte codes. I got Soundex out of Sedgewick (ex Knuth) long ago, used it to generate longer keys (not truncated) and suggested lists of corrections for 0 and 1-letter substitutions. I applied this to large databases of census and postal data.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文