关于如何改进当前模糊搜索实施的建议

发布于 2024-09-28 04:26:07 字数 1948 浏览 0 评论 0原文

我目前正在致力于实现术语网络服务的模糊搜索,并且正在寻找有关如何改进当前实现的建议。代码太多,无法分享,但我认为解释可能足以引发深思熟虑的建议。我意识到需要阅读的内容很多,但我会很感激任何帮助。

首先,术语基本上只是一些名称(或术语)。对于每个单词,我们按空格将其拆分为标记,然后迭代每个字符以将其添加到 trie 中。在终端节点上(例如当到达草莓中的字符 y 时),我们将主术语列表的索引存储在列表中。因此,一个终端节点可以有多个索引(因为草莓的终端节点将匹配“草莓”和“对草莓过敏”)。

至于实际搜索,搜索查询也按空格分解为标记。搜索算法针对每个标记运行。搜索标记的第一个字符必须是匹配项(因此 traw 永远不会匹配草莓)。之后,我们遍历每个连续节点的子节点。如果存在具有匹配字符的子项,我们将继续使用搜索标记的下一个字符进行搜索。如果子级与给定字符不匹配,我们将使用搜索标记的当前字符查看子级(因此不会推进它)。这是模糊部分,因此“stwb”将匹配“strawberry”。

当我们到达搜索标记的末尾时,我们将搜索该节点处的 trie 结构的其余部分,以获取所有潜在的匹配项(因为主术语列表的索引仅位于终端节点上)。我们称之为卷起。我们通过在 BitSet 上设置索引值来存储索引。然后,我们简单地从每个搜索令牌结果中获取 BitSets。然后,我们从 anded BitSets 中取出前 1000 或 5000 个索引,并找到它们对应的实际术语。我们使用 Levenshtein 对每个术语进行评分,然后按分数排序以获得最终结果。

这工作得相当好,而且速度相当快。树中有超过 39 万个节点和超过 110 万个实际术语名称。然而,目前的情况存在一些问题。

例如,搜索“car cat”将返回 Catheterization,而我们不希望它返回(由于搜索查询是两个单词,因此结果应该至少是两个)。这很容易检查,但它不能处理像导管插入手术这样的情况,因为它是两个词。理想情况下,我们希望它与心导管插入术等相匹配。

基于纠正这个问题的需要,我们提出了一些改变。其一,我们通过混合深度/广度搜索来遍历 trie。本质上,只要角色匹配,我们就首先深入。那些不匹配的子节点将被添加到优先级队列中。优先级队列按编辑距离排序,编辑距离可以在搜索 trie 时计算(因为如果存在字符匹配,则距离保持不变,如果不匹配,则增加 1)。通过这样做,我们得到每个单词的编辑距离。 我们不再使用 BitSet。相反,它是索引到 Terminfo 对象的映射。该对象存储查询短语和术语短语的索引以及分数。因此,如果搜索是“car cat”并且匹配的术语是“Catheterization procedure”,则术语短语索引将为 1,查询短语索引也是如此。对于“心导管插入术”,术语短语索引将为 1,2,查询短语索引也是如此。正如您所看到的,之后查看术语短语索引和查询短语索引的计数非常简单,如果它们不至少等于搜索字数,则可以将其丢弃。

之后,我们将单词的编辑距离相加,从术语中删除与术语短语索引匹配的单词,并计算剩余的字母以获得真实的编辑距离。例如,如果您匹配术语“对草莓过敏”并且您的搜索查询是“straw”,则草莓的得分为 7,那么您可以使用术语短语索引从该术语中丢弃草莓,然后只计数“过敏”(减去空格)得到 16 分。

这得到了我们期望的准确结果。然而,它太慢了。以前我们可以在一个词搜索上花费 25-40 毫秒,现在可能长达半秒。这主要是由于实例化 TermInfo 对象、使用 .add() 操作、.put() 操作以及我们必须返回大量匹配项这一事实。我们可以将每次搜索限制为仅返回 1000 个匹配项,但不能保证“car”的前 1000 个结果与“cat”的前 1000 个匹配项中的任何一个匹配(请记住,有超过 110 万个术语)。

即使对于单个查询词,比如 cat,我们仍然需要大量的匹配。这是因为,如果我们搜索“cat”,搜索将匹配 car 并汇总其下方的所有终端节点(这将是很多)。然而,如果我们限制结果的数量,就会过分强调以查询开头的单词,而不是编辑距离。因此,像“catheterization”这样的词比“coat”这样的词更有可能被包含在内。

那么,基本上,我们是否有任何想法可以解决第二个实现解决的问题,但又不会像它那样降低速度?如果可以让事情变得更清晰,我可以包含一些选定的代码,但我不想发布一大堆代码。

I'm currently working on implementing a fuzzy search for a terminology web service and I'm looking for suggestions on how I might improve the current implementation. It's too much code to share, but I think an explanation might suffice to prompt thoughtful suggestions. I realize it's a lot to read but I'd appreciate any help.

First, the terminology is basically just a number of names (or terms). For each word, we split it into tokens by space and then iterate through each character to add it to the trie. On a terminal node (such as when the character y in strawberry is reached) we store in a list an index to the master term list. So a terminal node can have multiple indices (since the terminal node for strawberry will match 'strawberry' and 'allergy to strawberry').

As for the actual search, the search query is also broken up into tokens by space. The search algorithm is run for each token. The first character of the search token must be a match (so traw will never match strawberry). After that, we go through children of each successive node. If there is child with a character that matches, we continue the search with the next character of the search token. If a child does not match the given character, we look at the children using the current character of the search token (so not advancing it). This is the fuzziness part, so 'stwb' will match 'strawberry'.

When we reach the end of the search token, we will search through the rest of the trie structure at that node to get all potential matches (since the indexes to the master term list are only on the terminal nodes). We call this the roll up. We store the indices by setting their value on a BitSet. Then, we simply and the BitSets from the results of each search token result. We then take, say, the first 1000 or 5000 indices from the anded BitSets and find the actual terms they correspond to. We use Levenshtein to score each term and then sort by score to get our final results.

This works fairly well and is pretty fast. There are over 390k nodes in the tree and over 1.1 million actual term names. However, there are problems with this as it stands.

For example, searching for 'car cat' will return Catheterization, when we don't want it to (since the search query is two words, the result should be at least two). That would be easy enough to check, but it doesn't take care of a situation like Catheterization Procedure, since it is two words. Ideally, we'd want it to match something like Cardiac Catheterization.

Based on the need to correct this, we came up with some changes. For one, we go through the trie in a mixed depth/breadth search. Essentially we go depth first as long as a character matches. Those child nodes that didn't match get added to a priority queue. The priority queue is ordered by edit distance, which can be calculated while searching the trie (since if there's a character match, distance remains the same and if not, it increases by 1). By doing this, we get the edit distance for each word.
We are no longer using the BitSet. Instead, it's a map of the index to a Terminfo object. This object stores the index of the query phrase and the term phrase and the score. So if the search is "car cat" and a term matched is "Catheterization procedure" the term phrase indices will be 1 as will the query phrase indices. For "Cardiac Catheterization" the term phrase indices will be 1,2 as will the query phrase indices. As you can see, it's very simple afterward to look at the count of term phrase indices and query phrase indices and if they aren't at least equal to the search word count, they can be discarded.

After that, we add up the edit distances of the words, remove the words from the term that match the term phrase index, and count the remaining letters to get the true edit distance. For example, if you matched the term "allergy to strawberries" and your search query was "straw" you would have a score of 7 from strawberries, then you'd use the term phrase index to discard strawberries from the term, and just count "allergy to" (minus the spaces) to get the score of 16.

This gets us the accurate results we expect. However, it is far too slow. Where before we could get 25-40 ms on one word search, now it could be as much as half a second. It's largely from things like instantiating TermInfo objects, using .add() operations, .put() operations and the fact that we have to return a large number of matches. We could limit each search to only return 1000 matches, but there's no guarantee that the first 1000 results for "car" would match any of the first 1000 matches for "cat" (remember, there are over 1.1. million terms).

Even for a single query word, like cat, we still need a large number of matches. This is because if we search for 'cat' the search is going to match car and roll up all the terminal nodes below it (which will be a lot). However, if we limited the number of results, it would place too heavy an emphasis on words that begin with the query and not the edit distance. Thus, words like catheterization would be more likely to be included than something like coat.

So, basically, are there any thoughts on how we could handle the problems that the second implementation fixed, but without as much of the speed slow down that it introduced? I can include some selected code if it might make things clearer but I didn't want to post a giant wall of code.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

鹿! 2024-10-05 04:26:07

哇...艰难的一个。

那你为什么不实现lucene呢?当涉及到像你这样的问题时,这是最好的、当前最先进的技术。

不过,我想分享一些想法……

模糊性并不是像稻草*那样的东西,而是某些单词的错误输入。每个缺失/错误的字符都会使距离加 1。

通常,同时具有部分匹配(通配符)和模糊性是非常非常困难的!

标记化通常是一个好主意。

一切都在很大程度上取决于您获得的数据。源文件中是否存在拼写错误,或者仅在搜索查询中存在拼写错误?

我已经看到一些使用多维范围树的非常好的实现。

但我真的认为,如果您想完成上述所有目标,您需要图形集和良好的索引算法的完美组合。

例如,您可以使用像 sesame 这样的语义数据库,并在导入文档时将每个标记和文档导入为节点。然后根据文档中的位置等,您可以添加加权关系。

然后,您需要某种结构中的标记,您可以在其中进行有效的模糊匹配,例如 bk 树。
我认为你可以在 mysql 数据库中对标记进行索引,并执行按位比较函数来获取差异。有一个函数可以返回所有匹配的位,如果您将字符串转换为 ascii 并对这些位进行分组,您可以很快地实现一些目标。

但是,如果您将标记与字符串进行匹配,您可以构造一个假设的完美匹配实体,并在语义数据库中查询最近的邻居。

在标记化时,您必须将单词分解为部分单词才能实现部分匹配。

但是,您也可以进行通配符匹配(前缀、后缀或两者),但不会出现模糊性。

您还可以对整个单词或不同的标记串联进行索引。

然而,可能有特殊的 bk-tree 实现支持这一点,但我从未见过。

Wow... tough one.

Well why don't you implement lucene? It is the best and current state of the art when it comes to problems like yours afaik.

However I want to share some thoughts...

Fuziness isnt something like straw* its rather the mis typing of some words. And every missing/wrong character adds 1 to the distance.

Its generally very, very hard to have partial matching (wildcards) and fuzziness at the same time!

Tokenizing is generally a good idea.

Everything also heavily depends on the data you get. Are there spelling mistakes in the source files or only in the search queries?

I have seen some pretty nice implementations using multi dimensional range trees.

But I really think if you want to accomplish all of the above you need a pretty neat combination of a graph set and a nice indexing algorithm.

You could for example use a semantic database like sesame and when importing your documents import every token and document as a node. Then depending on position in the document etc you can add a weighted relation.

Then you need the tokens in some structure where you can do efficient fuzzy matches such as bk-trees.
I think you could index the tokens in a mysql database and do bit-wise comparision functions to get differences. Theres a function that returns all matching bits, if you translit your strings to ascii and group the bits you could achieve something pretty fast.

However if you matched the tokens to the string you can construct a hypothetical perfect match antity and query your semantic database for the nearest neighbours.

You would have to break the words apart into partial words when tokenizing to achieve partial matches.

However you can do also wildcard matches (prefix, suffix or both) but no fuzziness then.

You can also index the whole word or different concatenations of tokens.

However there may be special bk-tree implementations that support this but i have never seen one.

清晨说晚安 2024-10-05 04:26:07

我很久以前就对拼写校正器进行了多次迭代,这里有一个 基本方法的最新描述。基本上,正确单词的字典位于字典树中,并且搜索是简单的分支定界。我使用了重复的深度优先特里遍历,以 lev 为界。因为,由于距离的每一个额外的增量都会导致更多的特里被遍历,所以对于小距离,成本基本上是距离的指数,所以进行组合深度/广度搜索并不会节省太多,但会使其复杂得多。

(旁白:你会惊讶于医生可以尝试以多种方式拼写“乙酰水杨酸”。)

我对你的特里结构的大小感到惊讶。一本基本词典中可接受的单词可能有几千个。然后是常见的前缀和后缀。由于该结构是一个 trie,因此您可以将子 trie 连接在一起并节省大量空间。就像基本前缀的trie可以连接到主字典一样,然后主字典的终端节点可以连接到公共后缀的trie(实际上可以包含循环)。换句话说,trie 可以推广为有限状态机。这给了你很大的灵活性。

不管怎样,你都会遇到性能问题。性能问题的好处是,问题越严重,就越容易发现。我一直在 StackOverflow 上指出了这一点。 此链接解释了如何执行此操作,链接到详细示例,并尝试消除一些流行的神话。简而言之,它花在做一些你可以优化的事情上的时间越多,如果你只是暂停并看一下,你就越有可能发现它正在这样做。我的怀疑是,很多时间都花在了对夸大的数据结构的操作上,而不是仅仅得到答案。这是一种常见的情况,但在样本直接指出问题之前不要解决任何问题。

I did a number of iterations of a spelling corrector ages ago, and here's a recent description of the basic method. Basically the dictionary of correct words is in a trie, and the search is a simple branch-and-bound. I used repeated depth-first trie walk, bounded by lev. distance because, since each additional increment of distance results in much more of the trie being walked, the cost, for small distance, is basically exponential in the distance, so going to a combined depth/breadth search doesn't save much but makes it a lot more complicated.

(Aside: You'd be amazed how many ways physicians can try to spell "acetylsalicylic acid".)

I'm surprised at the size of your trie. A basic dictionary of acceptable words is maybe a few thousand. Then there are common prefixes and suffixes. Since the structure is a trie, you can connect together sub-tries and save a lot of space. Like the trie of basic prefixes can connect to the main dictionary, and then the terminal nodes of the main dictionary can connect to the trie of common suffixes (which can in fact contain cycles). In other words, the trie can be generalized into a finite state machine. That gives you a lot of flexibility.

REGARDLESS of all that, you have a performance problem. The nice thing about performance problems is, the worse they are, the easier they are to find. I've been a real pest on StackOverflow pointing this out. This link explains how to do it, links to a detailed example, and tries to dispel some popular myths. In a nutshell, the more time it is spending doing something that you could optimize, the more likely you will catch it doing that if you just pause it and take a look. My suspicion is that a lot of time is going into operations on overblown data structure, rather than just getting to the answer. That's a common situation, but don't fix anything until samples point you directly at the problem.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文