如何更正用户输入(有点像谷歌“你是说吗?”)

发布于 2024-08-01 12:52:14 字数 1349 浏览 9 评论 0原文

我有以下要求: -

我有很多(比如一百万)值(名称)。 用户将键入搜索字符串。

我不希望用户正确拼写名称。

所以,我想让谷歌“你是说吗”。 这将列出我的数据存储中的所有可能值。 有一个类似但不相同的问题在这里。 这没有回答我的问题。

我的问题: - 1)我认为不建议将这些数据存储在RDBMS中。 因为这样我就不会过滤 SQL 查询。 我必须进行全表扫描。 那么,在这种情况下数据应该如何存储?

2)第二个问题与这个。 但是,只是为了我的问题的完整性:如何搜索大数据集? 假设数据集中有一个名字 Franky。 如果用户输入 Phranky,我如何匹配 Franky? 我必须循环遍历所有名称吗?

我遇到了 Levenshtein Distance,这将是查找可能字符串的好方法。 但同样,我的问题是我是否必须对数据存储中的所有 100 万个值进行操作?

3)我知道,谷歌是通过观察用户行为来做到这一点的。 但我想在不观察用户行为的情况下做到这一点,即通过使用我还不知道的距离算法。 因为前一种方法一开始就需要大量的搜索!

4)正如 Kirk Broadhurst 在答案中指出下面,有两种可能的情况: -

  • 用户输入错误的单词(编辑 距离算法)
  • 用户不认识单词并猜测 (语音匹配算法)

我对这两个都很感兴趣。 它们实际上是两个不同的东西; 例如,Sean 和 Shawn 听起来相同,但编辑距离为 3 - 太高而不能被视为拼写错误。

I have the following requirement: -

I have many (say 1 million) values (names).
The user will type a search string.

I don't expect the user to spell the names correctly.

So, I want to make kind of Google "Did you mean". This will list all the possible values from my datastore. There is a similar but not same question here. This did not answer my question.

My question: -
1) I think it is not advisable to store those data in RDBMS. Because then I won't have filter on the SQL queries. And I have to do full table scan. So, in this situation how the data should be stored?

2) The second question is the same as this. But, just for the completeness of my question: how do I search through the large data set?
Suppose, there is a name Franky in the dataset.
If a user types as Phranky, how do I match the Franky? Do I have to loop through all the names?

I came across Levenshtein Distance, which will be a good technique to find the possible strings. But again, my question is do I have to operate on all 1 million values from my data store?

3) I know, Google does it by watching users behavior. But I want to do it without watching user behavior, i.e. by using, I don't know yet, say distance algorithms. Because the former method will require large volume of searches to start with!

4) As Kirk Broadhurst pointed out in an answer below, there are two possible scenarios: -

  • Users mistyping a word (an edit
    distance algorithm)
  • Users not knowing a word and guessing
    (a phonetic match algorithm)

I am interested in both of these. They are really two separate things; e.g. Sean and Shawn sound the same but have an edit distance of 3 - too high to be considered a typo.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

半暖夏伤 2024-08-08 12:52:14

我会考虑使用预先存在的解决方案。

Aspell名称的自定义字典可能非常适合于此。 生成词典文件将预先计算快速给出建议所需的所有信息。

I would consider using a pre-existing solution for this.

Aspell with a custom dictionary of the names might be well suited for this. Generating the dictionary file will pre-compute all the information required to quickly give suggestions.

谜兔 2024-08-08 12:52:14

您有两个可能需要解决的问题(或者如果您选择不解决)

  1. 用户输入错误的单词(编辑距离算法)
  2. 用户不认识单词并猜测(语音匹配算法)

您对这两个都感兴趣吗?或者只是其中之一? 它们实际上是两个不同的东西; 例如,Sean 和 Shawn 听起来相同,但编辑距离为 3 - 太高而不能被视为拼写错误。

您应该预先索引字数,以确保您只建议相关答案(类似于 ealdent 的建议)。 例如,如果我输入 sith,我可能会被问到我是否指的是 smith,但是如果我输入 smith 则没有意义建议西斯。 确定一种算法来测量单词的相对可能性,并仅建议更有可能的单词。

我在松散匹配方面的经验强化了一个简单但重要的学习 - 根据需要执行尽可能多的索引/筛选层,不要害怕包含超过 2 或 3 个。剔除任何不以正确字母开头的内容,例如例如,然后剔除所有不以正确字母结尾的内容,依此类推。 您实际上只想对尽可能小的数据集执行编辑距离计算,因为这是一项非常密集的操作。

因此,如果您有 O(n)、O(nlogn) 和 O(n^2) 算法 - 按此顺序执行所有这三个算法,以确保您只将“良好前景”放入重型算法中。

You have two possible issues that you need to address (or not address if you so choose)

  1. Users mistyping a word (an edit distance algorithm)
  2. Users not knowing a word and guessing (a phonetic match algorithm)

Are you interested in both of these, or just one or the other? They are really two separate things; e.g. Sean and Shawn sound the same but have an edit distance of 3 - too high to be considered a typo.

You should pre-index the count of words to ensure you are only suggesting relevant answers (similar to ealdent's suggestion). For example, if I entered sith I might expect to be asked if I meant smith, however if I typed smith it would not make sense to suggest sith. Determine an algorithm which measures the relative likelihood a word and only suggest words that are more likely.

My experience in loose matching reinforced a simple but important learning - perform as many indexing/sieve layers as you need and don't be scared of including more than 2 or 3. Cull out anything that doesn't start with the correct letter, for instance, then cull everything that doesn't end in the correct letter, and so on. You really only want to perform edit distance calculation on the smallest possible dataset as it is a very intensive operation.

So if you have an O(n), an O(nlogn), and an O(n^2) algorithm - perform all three, in that order, to ensure you are only putting your 'good prospects' through to your heavy algorithm.

夜深人未静 2024-08-08 12:52:14

对于推荐 Soundex 的人来说,它已经过时了。 变音位(简单)或双变音位(复杂)要好得多。 如果它确实是名称数据,如果名称起源于欧洲,或者至少是语音,那么它应该可以正常工作。

至于搜索,如果您愿意自己进行搜索,而不是使用 Aspell 或其他一些智能数据结构...在天真的情况下,预先计算可能的匹配项是 O(n^2),但我们知道为了要完全匹配,它们必须有一个“音素”重叠,甚至可能有两个。 这个预索引步骤(误报率较低)可以大大降低复杂性(在实际情况下,类似于 O(30^2 * k^2),其中 k << n) 。

For people who are recommending Soundex, it is very out of date. Metaphone (simpler) or Double Metaphone (complex) are much better. If it really is name data, it should work fine, if the names are European-ish in origin, or at least phonetic.

As for the search, if you care to roll your own, rather than use Aspell or some other smart data structure... pre-calculating possible matches is O(n^2), in the naive case, but we know in order to be matching at all, they have to have a "phoneme" overlap, or may even two. This pre-indexing step (which has a low false positive rate) can take down the complexity a lot (to in the practical case, something like O(30^2 * k^2), where k is << n).

べ繥欢鉨o。 2024-08-08 12:52:14

正如您引用的问题的答案之一一样,Peter Norvig 的出色的解决方案适用于这是完整的 Python 代码。 谷歌可能会通过多种方式查询建议,但他们所需要的是大量数据。 当然,他们可以使用大量查询日志对用户行为进行建模,但他们也可以仅使用文本数据,通过查看哪种更正更常见来找到最有可能正确的单词拼写。 someting 这个词没有出现在字典中,尽管它是一个常见的拼写错误,但正确的拼写更为常见。 当您找到相似的单词时,您需要最接近拼写错误且在给定上下文中最有可能的单词。

诺维格的解决方案是从古腾堡计划中获取多本书的语料库,并计算出现的单词数。 他根据这些单词创建了一个字典,您还可以在其中估计单词的概率 (COUNT(单词) / COUNT(所有单词))。 如果您将这一切存储为直接哈希,访问速度很快,但存储可能会成为问题,因此您也可以使用 后缀尝试。 访问时间仍然相同(如果您基于哈希实现),但存储要求可能要少得多。

接下来,他对拼写错误的单词进行简单的编辑(通过删除、添加或替换字母),然后使用语料库中的词典来限制可能性列表。 这是基于编辑距离(例如 Levenshtein 距离)的思想,简单的启发是,大多数拼写错误发生在编辑距离为 2 或更小的情况下。 您可以根据您的需求和计算能力来扩展此范围。

一旦他有了可能的单词,他就会从语料库中找到最可能的单词,这就是你的建议。 您可以添加很多东西来改进模型。 例如,您还可以通过考虑拼写错误中字母的键盘距离来调整概率。 当然,这是假设用户使用的是英文 QWERTY 键盘。 例如,调换 eq 比调换 el 的可能性更大。

Just as in one of the answers to the question you reference, Peter Norvig's great solution would work for this, complete with Python code. Google probably does query suggestion a number of ways, but the thing they have going for them is lots of data. Sure they can go model user behavior with huge query logs, but they can also just use text data to find the most likely correct spelling for a word by looking at which correction is more common. The word someting does not appear in a dictionary and even though it is a common misspelling, the correct spelling is far more common. When you find similar words you want the word that is both the closest to the misspelling and the most probable in the given context.

Norvig's solution is to take a corpus of several books from Project Gutenberg and count the words that occur. From those words he creates a dictionary where you can also estimate the probability of a word (COUNT(word) / COUNT(all words)). If you store this all as a straight hash, access is fast, but storage might become a problem, so you can also use things like suffix tries. The access time is still the same (if you implement it based on a hash), but storage requirements can be much less.

Next, he generates simple edits for the misspelt word (by deleting, adding, or substituting a letter) and then constrains the list of possibilities using the dictionary from the corpus. This is based on the idea of edit distance (such as Levenshtein distance), with the simple heuristic that most spelling errors take place with an edit distance of 2 or less. You can widen this as your needs and computational power dictate.

Once he has the possible words, he finds the most probable word from the corpus and that is your suggestion. There are many things you can add to improve the model. For example, you can also adjust the probability by considering the keyboard distance of the letters in the misspelling. Of course, that assumes the user is using a QWERTY keyboard in English. For example, transposing an e and a q is more likely than transposing an e and an l.

萌梦深 2024-08-08 12:52:14

只需使用 Solr 或类似的搜索服务器,然后您就不必成为以下方面的专家主题。 使用拼写建议列表,对每个建议结果运行搜索,如果结果多于当前搜索查询,则将其添加为“您的意思是吗”结果。 (这可以防止虚假的拼写建议,而这些建议实际上不会返回更多相关的命中。)这样,您就不需要收集大量数据来提供初始的“您的意思是吗”,尽管 Solr 有一些机制可以让您可以手动调整某些查询的结果。

通常,您不会使用 RDBMS 进行此类搜索,而是依赖用于此目的的只读、稍微陈旧的数据库。 (Solr 为底层 Lucene 引擎和数据库添加了一个友好的编程接口和配置。)在我工作的公司的网站上,夜间服务从 RDBMS 中选择更改的记录,并将它们作为文档推送到 Solr 中。 只需很少的努力,我们就有了一个系统,搜索框可以非常有效地搜索产品、客户评论、网站页面和博客条目,并在搜索结果中提供拼写建议,以及像您在 NewEgg 上看到的分面浏览, Netflix 或 Home Depot,对服务器(尤其是 RDBMS)几乎没有增加压力。 (我相信 Zappo 的 [新网站] 和 Netflix 都在内部使用 Solr,但不要引用我的话。)

在您的场景中,您将使用名称列表填充 Solr 索引,并选择适当的匹配算法在配置文件中。

Just use Solr or a similar search server, and then you won't have to be an expert in the subject. With the list of spelling suggestions, run a search with each suggested result, and if there are more results than the current search query, add that as a "did you mean" result. (This prevents bogus spelling suggestions that don't actually return more relevant hits.) This way, you don't require a lot of data to be collected to make an initial "did you mean" offering, though Solr has mechanisms by which you can hand-tune the results of certain queries.

Generally, you wouldn't be using an RDBMS for this type of searching, instead depending on read-only, slightly stale databases intended for this purpose. (Solr adds a friendly programming interface and configuration to an underlying Lucene engine and database.) On the Web site for the company that I work for, a nightly service selects altered records from the RDBMS and pushes them as a documents into Solr. With very little effort, we have a system where the search box can search products, customer reviews, Web site pages, and blog entries very efficiently and offer spelling suggestions in the search results, as well as faceted browsing such as you see at NewEgg, Netflix, or Home Depot, with very little added strain on the server (particularly the RDBMS). (I believe both Zappo's [the new site] and Netflix use Solr internally, but don't quote me on that.)

In your scenario, you'd be populating the Solr index with the list of names, and select an appropriate matching algorithm in the configuration file.

宣告ˉ结束 2024-08-08 12:52:14

这是一个老问题,DWIM(照我的意思做) ,由 Warren Teitelman 在 Xerox Alto 上实施而闻名。 如果您的问题与发音有关,这里有一份调查论文可能会有所帮助:

J. Zobel 和 P. Dart,“拼音字符串匹配:信息检索的经验教训”,Proc。 第19届年度国际米兰。 ACM SIGIR 会议 信息检索研究与发展 (SIGIR'96),1996 年 8 月,第 166-172 页。

我从事信息检索工作的朋友告诉我,Knuth 所描述的 Soundex 现在被认为已经非常过时了。

This is an old problem, DWIM (Do What I Mean), famously implemented on the Xerox Alto by Warren Teitelman. If your problem is based on pronunciation, here is a survey paper that might help:

J. Zobel and P. Dart, "Phonetic String Matching: Lessons from Information Retieval," Proc. 19th Annual Inter. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR'96), Aug. 1996, pp. 166-172.

I'm told by my friends who work in information retrieval that Soundex as described by Knuth is now considered very outdated.

忘年祭陌 2024-08-08 12:52:14

Bitap 算法 旨在查找文本正文中的近似匹配。 也许你可以用它来计算可能的匹配。 (它基于 Levenshtein 距离

(更新:阅读 本·S 答案(使用现有的解决方案,可能aspell)是要走的路)


正如其他人所说,Google通过观察用户自行纠正来进行自动纠正。 如果我搜索“someting”(原文如此),然后立即搜索“someting”,则第一个查询很可能不正确。 检测此问题的可能启发式方法是:

  • 如果用户在短时间内进行了两次搜索,并且
  • 第一个查询没有产生任何结果(或者用户没有单击任何内容),
  • 查询确实产生了有用的结果。
  • 则第二个查询确实产生了有用的结果,这两个 查询相似(具有较小的 Levenshtein 距离),

那么第二个查询可能是第一个查询的细化,您可以将其存储并呈现给其他用户。

请注意,您可能需要大量查询来收集足够的数据以使这些建议发挥作用。

the Bitap Algorithm is designed to find an approximate match in a body of text. Maybe you could use that to calculate probable matches. (it's based on the Levenshtein Distance)

(Update: after having read Ben S answer (use an existing solution, possibly aspell) is the way to go)


As others said, Google does auto correction by watching users correct themselves. If I search for "someting" (sic) and then immediately for "something" it is very likely that the first query was incorrect. A possible heuristic to detect this would be:

  • If a user has done two searches in a short time window, and
  • the first query did not yield any results (or the user did not click on anything)
  • the second query did yield useful results
  • the two queries are similar (have a small Levenshtein distance)

then the second query is a possible refinement of the first query which you can store and present to other users.

Note that you probably need a lot of queries to gather enough data for these suggestions to be useful.

梦里寻她 2024-08-08 12:52:14

Soundex 算法可以帮助您解决这个问题。

http://en.wikipedia.org/wiki/Soundex

您可以预先生成 soundex每个名称的值并将其存储在数据库中,然后对其建立索引以避免扫描表。

The Soundex algorithm may help you out with this.

http://en.wikipedia.org/wiki/Soundex

You could pre-generate the soundex values for each name and store it in the database, then index that to avoid having to scan the table.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文