Google 的“您是说吗?”是如何显示的? 算法工作?

发布于 2024-07-09 09:06:17 字数 1473 浏览 6 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(18

滥情空心 2024-07-16 09:06:17

这是直接来自来源的解释(几乎)

搜索 101!

22:03 分钟

值得观看!

根据 Google 前首席技术官道格拉斯·梅里尔 (Douglas Merrill) 的说法,基本上是这样的:

1) 你在 google 中写了一个(拼写错误的)单词

2) 你没有找到你想要的东西(不要点击任何结果)

3) 你意识到你该单词拼写错误,因此您在搜索框中重写了该单词。

4)你找到你想要的东西(你点击第一个链接)

这个模式乘以数百万次,显示什么是最常见的拼写错误以及什么是最“常见”的更正。

这样,谷歌几乎可以立即提供每种语言的拼写纠正功能。

这也意味着,如果一夜之间每个人都开始将 night 拼写为“nigth”,谷歌会建议用这个词代替。

编辑

@ThomasRutter:道格拉斯将其描述为“统计机器学习”。

他们知道谁更正了查询,因为他们知道哪个查询来自哪个用户(使用 cookies)。

如果用户执行查询,只有 10% 的用户点击结果,90% 的用户返回并输入另一个查询(使用更正了的单词),这次 90% 的人点击了结果,那么他们就知道自己找到了更正。

他们还可以知道这些是否是两个不同的“相关”查询,因为他们拥有所显示的所有链接的信息。

此外,他们现在将上下文纳入拼写检查中,因此他们甚至可以根据上下文建议不同的单词。

请参阅此 Google Wave 演示 (@ 44m 06s),其中展示了如何考虑上下文来自动更正拼写。

此处解释了自然语言处理的工作原理。

最后,这是一个很棒的演示,展示了添加自动机器翻译 (@ 1h 12m 47s) 进行混合。

<子>
我在视频中添加了分钟和秒的锚点,以便直接跳到内容,如果不起作用,请尝试重新加载页面或手动滚动到标记。

Here's the explanation directly from the source ( almost )

Search 101!

at min 22:03

Worth watching!

Basically and according to Douglas Merrill former CTO of Google it is like this:

1) You write a ( misspelled ) word in google

2) You don't find what you wanted ( don't click on any results )

3) You realize you misspelled the word so you rewrite the word in the search box.

4) You find what you want ( you click in the first links )

This pattern multiplied millions of times, shows what are the most common misspells and what are the most "common" corrections.

This way Google can almost instantaneously, offer spell correction in every language.

Also this means if overnight everyone start to spell night as "nigth" google would suggest that word instead.

EDIT

@ThomasRutter: Douglas describe it as "statistical machine learning".

They know who correct the query, because they know which query comes from which user ( using cookies )

If the users perform a query, and only 10% of the users click on a result and 90% goes back and type another query ( with the corrected word ) and this time that 90% clicks on a result, then they know they have found a correction.

They can also know if those are "related" queries of two different, because they have information of all the links they show.

Furthermore, they are now including the context into the spell check, so they can even suggest different word depending on the context.

See this demo of google wave ( @ 44m 06s ) that shows how the context is taken into account to automatically correct the spelling.

Here it is explained how that natural language processing works.

And finally here is an awesome demo of what can be done adding automatic machine translation ( @ 1h 12m 47s ) to the mix.


I've added anchors of minute and seconds to the videos to skip directly to the content, if they don't work, try reloading the page or scrolling by hand to the mark.

俯瞰星空 2024-07-16 09:06:17

我前段时间发现这篇文章:如何编写拼写校正器,作者:Peter Norvig(Google Inc. 研究总监)。

这是一本关于“拼写纠正”主题的有趣读物。 这些例子是用Python编写的,但它清晰易懂,我认为该算法可以很容易地实现
翻译成其他语言。

下面是该算法的简短描述。
该算法由两个步骤组成:准备和单词检查。

第 1 步:准备 - 设置单词数据库

最好的是您可以使用实际的搜索单词及其出现情况。
如果没有,可以使用大量文本来代替。
计算每个单词的出现次数(流行度)。

第 2 步. 单词检查 - 查找与检查的单词相似的单词

相似意味着编辑距离较低(通常为 0-1 或 0-2)。 编辑距离是将一个单词转换为另一个单词所需的最小插入/删除/更改/交换次数。

从上一步中选择最流行的单词并建议将其作为更正(如果不是单词本身)。

I found this article some time ago: How to Write a Spelling Corrector, written by Peter Norvig (Director of Research at Google Inc.).

It's an interesting read about the "spelling correction" topic. The examples are in Python but it's clear and simple to understand, and I think that the algorithm can be easily
translated to other languages.

Below follows a short description of the algorithm.
The algorithm consists of two steps, preparation and word checking.

Step 1: Preparation - setting up the word database

Best is if you can use actual search words and their occurence.
If you don't have that a large set of text can be used instead.
Count the occurrence (popularity) of each word.

Step 2. Word checking - finding words that are similar to the one checked

Similar means that the edit distance is low (typically 0-1 or 0-2). The edit distance is the minimum number of inserts/deletes/changes/swaps needed to transform one word to another.

Choose the most popular word from the previous step and suggest it as a correction (if other than the word itself).

燃情 2024-07-16 09:06:17

关于“did youmean”算法的原理可以参考《信息检索导论》第三章。 它可以免费在线获取。 第 3.3 节(第 52 页)准确回答了您的问题问题。 要专门回答您的更新,您只需要一本单词词典,不需要其他任何东西(包括数百万用户)。

For the theory of "did you mean" algorithm you can refer to Chapter 3 of Introduction to Information Retrieval. It is available online for free. Section 3.3 (page 52) exactly answers your question. And to specifically answer your update you only need a dictionary of words and nothing else (including millions of users).

不忘初心 2024-07-16 09:06:17

嗯...我认为谷歌利用他们庞大的数据集(互联网)来做一些严肃的 NLP(自然语言处理)。

例如,他们拥有来自整个互联网的海量数据,以至于可以计算三个单词序列(称为三元组)出现的次数。 因此,如果他们看到像“pink frugr Concert”这样的句子,他们可能会发现它的点击量很少,然后在他们的语料库中找到最有可能的“pink * Concert”。

不过,他们显然只是做了 Davide Gualano 所说的一种变体,所以一定要阅读该链接。 谷歌当然会使用它所知道的所有网页作为语料库,因此这使得它的算法特别有效。

Hmm... I thought that google used their vast corpus of data (the internet) to do some serious NLP (Natural Language Processing).

For example, they have so much data from the entire internet that they can count the number of times a three-word sequence occurs (known as a trigram). So if they see a sentence like: "pink frugr concert", they could see it has few hits, then find the most likely "pink * concert" in their corpus.

They apparently just do a variation of what Davide Gualano was saying, though, so definitely read that link. Google does of course use all web-pages it knows as a corpus, so that makes its algorithm particularly effective.

九公里浅绿 2024-07-16 09:06:17

我的猜测是,他们结合使用了 Levenshtein 距离 算法和他们收集的有关运行的搜索。 他们可以提取一组与输入的搜索字符串编辑距离最短的搜索,然后选择结果最多的搜索。

My guess is that they use a combination of a Levenshtein distance algorithm and the masses of data they collect regarding the searches that are run. They could pull a set of searches that have the shortest Levenshtein distance from the entered search string, then pick the one with the most results.

无言温柔 2024-07-16 09:06:17

通常,生产拼写纠正器利用多种方法来提供拼写建议。 其中一些是:

  • 确定一种方法来确定是否需要拼写更正。 这些可能包括结果不足、结果不够具体或不够准确(根据某种措施)等。然后:

  • 使用大量文本或字典,其中所有或大多数都已知拼写正确。 这些很容易在网上找到,例如 LingPipe。 然后,为了确定最佳建议,您可以根据多种度量寻找最接近匹配的单词。 最直观的就是人物相似。 研究和实验表明,两个或三个字符序列匹配效果更好。 (二元组和三元组)。 为了进一步提高结果,请在单词开头或结尾的匹配上权衡更高的分数。 出于性能原因,将所有这些单词索引为三元组或二元组,以便在执行查找时转换为 n-gram,并通过哈希表或 trie 进行查找。

  • 根据字符位置使用与潜在键盘错误相关的启发法。 因此“hwllo”应该是“hello”,因为“w”接近“e”。

  • 使用音标(Soundex、Metaphone)来索引单词并查找可能的更正。 实际上,如上所述,这通常会比使用 n-gram 索引返回更糟糕的结果。

  • 在每种情况下,您都必须从列表中选择最佳更正。 这可能是距离度量,例如编辑、键盘度量等。

  • 对于多单词短语,只有一个单词可能拼写错误,在这种情况下,您可以使用其余单词作为上下文来确定最佳匹配。

Normally a production spelling corrector utilizes several methodologies to provide a spelling suggestion. Some are:

  • Decide on a way to determine whether spelling correction is required. These may include insufficient results, results which are not specific or accurate enough (according to some measure), etc. Then:

  • Use a large body of text or a dictionary, where all, or most are known to be correctly spelled. These are easily found online, in places such as LingPipe. Then to determine the best suggestion you look for a word which is the closest match based on several measures. The most intuitive one is similar characters. What has been shown through research and experimentation is that two or three character sequence matches work better. (bigrams and trigrams). To further improve results, weigh a higher score upon a match at the beginning, or end of the word. For performance reasons, index all these words as trigrams or bigrams, so that when you are performing a lookup, you convert to n-gram, and lookup via hashtable or trie.

  • Use heuristics related to potential keyboard mistakes based on character location. So that "hwllo" should be "hello" because 'w' is close to 'e'.

  • Use a phonetic key (Soundex, Metaphone) to index the words and lookup possible corrections. In practice this normally returns worse results than using n-gram indexing, as described above.

  • In each case you must select the best correction from a list. This may be a distance metric such as levenshtein, the keyboard metric, etc.

  • For a multi-word phrase, only one word may be misspelled, in which case you can use the remaining words as context in determining a best match.

ぽ尐不点ル 2024-07-16 09:06:17

使用 Levenshtein 距离,然后创建一个 Metric Tree(或 Slim 树)来索引单词。
然后运行 ​​1-Nearest Neighbor 查询,即可得到结果。

Use Levenshtein distance, then create a Metric Tree (or Slim tree) to index words.
Then run a 1-Nearest Neighbour query, and you got the result.

满栀 2024-07-16 09:06:17

谷歌显然会建议具有最佳结果的查询,而不是那些拼写正确的查询。 但在这种情况下,拼写纠正器可能更可行,当然,您可以根据返回结果的良好程度的某些指标为每个查询存储一些值。

因此,

  1. 您需要一本字典(英语或基于您的数据)

  2. 生成单词网格并使用您的字典计算转换的概率。

  3. 添加解码器以使用网格计算最小误差距离。 当然,在计算距离时应该注意插入和删除。 有趣的是,如果您按下彼此靠近的按键,QWERTY 键盘会最大化距离。(cae 会转动汽车,cay 会转动猫)

  4. 返回距离最小的单词。

  5. 然后您可以将其与查询数据库进行比较,并检查其他接近匹配是否有更好的结果。

Google apparently suggests queries with best results, not with those which are spelled correctly. But in this case, probably a spell-corrector would be more feasible, Of course you could store some value for every query, based on some metric of how good results it returns.

So,

  1. You need a dictionary (english or based on your data)

  2. Generate a word trellis and calculate probabilities for the transitions using your dictionary.

  3. Add a decoder to calculate minimum error distance using your trellis. Of course you should take care of insertions and deletions when calculating distances. Fun thing is that QWERTY keyboard maximizes the distance if you hit keys close to each other.(cae would turn car, cay would turn cat)

  4. Return the word which has the minimum distance.

  5. Then you could compare that to your query database and check if there is better results for other close matches.

层林尽染 2024-07-16 09:06:17

这是我找到的最佳答案,由 Google 研究总监 Peter Norvig 实施和描述的拼写校正器。

如果您想了解更多有关其背后的理论,可以阅读他的书章

该算法的思想基于统计机器学习。

Here is the best answer I found, Spelling corrector implemented and described by Google's Director of Research Peter Norvig.

If you want to read more about the theory behind this, you can read his book chapter.

The idea of this algorithm is based on statistical machine learning.

带上头具痛哭 2024-07-16 09:06:17

几年前我看到了一些关于这方面的内容,所以从那以后可能已经发生了变化,但显然他们是通过分析在短时间内提交非常相似的查询的相同用户的日志开始的,并根据用户的纠正方式使用机器学习他们自己。

I saw something on this a few years back, so may have changed since, but apparently they started it by analysing their logs for the same users submitting very similar queries in a short space of time, and used machine learning based on how users had corrected themselves.

浅笑轻吟梦一曲 2024-07-16 09:06:17

作为猜测......

  1. 它可以搜索单词,
  2. 如果找不到的话,

使用某种算法来尝试“猜测”该单词。 可能是来自人工智能的东西,比如 Hopfield 网络或反向传播网络,或者其他“识别指纹”、恢复损坏的数据或拼写更正,正如 Davide 已经提到的......

As a guess... it could

  1. search for words
  2. if it is not found use some algorithm to try to "guess" the word.

Could be something from AI like Hopfield network or back propagation network, or something else "identifying fingerprints", restoring broken data, or spelling corrections as Davide mentioned already ...

贩梦商人 2024-07-16 09:06:17

除了上面的答案之外,如果您想快速自己实现一些东西,这里有一个建议 -

算法

您可以在

  • 使用比较器创建优先级队列。
  • 创建一个 Ternay 搜索树并插入所有英语单词(来自 Norvig 的帖子)及其频率。
  • 开始遍历 TST,对于 TST 中遇到的每个单词,根据 input_word 计算其编辑距离(LD),
  • 如果 LD ≤ 3,则将其放入优先级队列中。
  • 最后从优先队列中提取10个单词并显示。

Apart from the above answers, in case you want to implement something by yourself quickly, here is a suggestion -

Algorithm

You can find the implementation and detailed documentation of this algorithm on GitHub.

  • Create a Priority Queue with a comparator.
  • Create a Ternay Search Tree and insert all english words (from Norvig's post) along with their frequencies.
  • Start traversing the TST and for every word encountered in TST, calculate its Levenshtein Distance(LD) from input_word
  • If LD ≤ 3 then put it in a Priority Queue.
  • At Last extract 10 words from the Priority Queue and display.
回心转意 2024-07-16 09:06:17

简单的。 他们拥有大量数据。 他们根据查询的频率以及用户点击的结果通常会产生每个可能的术语的统计数据......因此,当他们看到您为搜索术语输入频繁的拼写错误时,他们会继续提出建议更常见的答案。

实际上,如果拼写错误实际上是最常搜索的术语,则算法会将其视为正确的术语。

Simple. They have tons of data. They have statistics for every possible term, based on how often it is queried, and what variations of it usually yield results the users click... so, when they see you typed a frequent misspelling for a search term, they go ahead and propose the more usual answer.

Actually, if the misspelling is in effect the most frequent searched term, the algorythm will take it for the right one.

狼性发作 2024-07-16 09:06:17

关于你的问题如何在没有大量数据的情况下模仿行为 - 为什么不使用谷歌收集的大量数据? 下载拼写错误的单词的 Google 搜索结果,然后搜索“Did”你的意思是:”在 HTML 中。

我想现在这就是所谓的混搭:-)

regarding your question how to mimic the behavior without having tons of data - why not use tons of data collected by google? Download the google sarch results for the misspelled word and search for "Did you mean:" in the HTML.

I guess that's called mashup nowadays :-)

渔村楼浪 2024-07-16 09:06:17

你的意思是说拼写检查器? 如果它是一个拼写检查器而不是整个短语,那么我有一个关于拼写检查的链接,其中算法是用 python 开发的。 检查此链接

同时,我也在从事包括使用文本搜索数据库的项目。 我想这会解决你的问题

You mean to say spell checker? If it is a spell checker rather than a whole phrase then I've got a link about the spell checking where the algorithm is developed in python. Check this link

Meanwhile, I am also working on project that includes searching databases using text. I guess this would solve your problem

想挽留 2024-07-16 09:06:17

这是一个老问题,我很惊讶没有人建议使用 Apache Solr 的 OP。

Apache Solr 是一个全文搜索引擎,除了许多其他功能外,还提供拼写检查或查询建议。 来自文档

默认情况下,Lucene 拼写检查器首先按
得分来自字符串距离计算,其次是频率
(如果有的话)索引中的建议。

This is an old question, and I'm surprised that nobody suggested the OP using Apache Solr.

Apache Solr is a full text search engine that besides many other functionality also provides spellchecking or query suggestions. From the documentation:

By default, the Lucene Spell checkers sort suggestions first by the
score from the string distance calculation and second by the frequency
(if available) of the suggestion in the index.

黄昏下泛黄的笔记 2024-07-16 09:06:17

有一种特定的数据结构 - 三元搜索树 - 自然支持部分匹配和近邻匹配。

There is a specific data structure - ternary search tree - that naturally supports partial matches and near-neighbor matches.

农村范ル 2024-07-16 09:06:17

最简单的解决方法是谷歌动态规划。

这是一种从信息检索中借用的算法,在现代生物信息学中被大量使用,以查看两个基因序列的相似程度。

最优解使用动态规划和递归。

这是一个非常解决的问题,有很多解决方案。 只需谷歌搜索,直到找到一些开源代码。

Easiest way to figure it out is to Google dynamic programming.

It's an algorithm that's been borrowed from Information Retrieval and is used heavily in modern day bioinformatics to see how similiar two gene sequences are.

Optimal solution uses dynamic programming and recursion.

This is a very solved problem with lots of solutions. Just google around until you find some open source code.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文