高性能分层文本搜索
我现在正处于升级主要事务系统中的层次结构设计的最后阶段,我已经盯着这个 150 行查询一段时间了(我将免去您阅读的所有乏味内容)并认为那里有一定有更好的方法。
该问题的简要总结如下:
如何实现分层搜索,以匹配层次结构中不同级别的多个搜索词,并针对最快搜索时间进行优化?
我发现了一个有些相关的问题 ,但这实际上只占我实际需要的答案的 20% 左右。以下是完整的场景/规范:
- 最终目标是在层次结构中的任意位置找到一个或多个任意项目。
- 完整的层次结构大约有 80,000 个节点,预计几年内将增长到 100 万个。
- 整个沿着层次结构的路径的全文是唯一且具有描述性的;但是,单个节点的文本可能不是。这是商业现实,而不是轻易做出的决定。
- 示例:节点可能具有诸如“门”之类的名称,该名称本身毫无意义,但完整的上下文是“Aaron > House > Living Room > Liquor Cabinet > Door” ”,含义明确,描述了特定位置的特定门。 (请注意,这只是一个示例,真正的设计远没有那么简单)
- 为了找到这个特定的门,用户可能会输入“aaron 酒门”,这可能只会出现一个结果。该查询被翻译为一个序列:一个包含文本“door”的项目,在一个包含文本“liquor”的项目下,在另一个包含文本“aaron”的项目下。
- 或者,用户可能只需输入“house wine”即可列出人们家中的所有酒柜(这不是很好吗)。我明确提到这个示例是为了表明搜索不需要匹配任何特定的根或叶级别。该用户确切地知道他正在寻找哪扇门,但无法立即记住谁拥有这扇门,并且会记住该名称是否出现在他面前。
- 所有术语必须按指定的顺序匹配,但正如上面的示例所示,可以“跳过”层次结构中的级别。术语“aaron booze Cabinet”将不匹配此节点。
- 该平台是 SQL Server 2008,但我相信这是一个与平台无关的问题,并且不希望将答案限制在该平台上。
- 层次结构本身基于
hierarchyid
(物化路径),按广度优先和深度优先索引。每个层次结构节点/记录都有一个要查询的Name
列。基于节点的层次结构查询速度非常快,所以不用担心这些。 - 没有严格的层次结构 - 根可能根本不包含任何节点,也可能包含 30 个子树,扇形分布为 10,000 个叶节点。
- 最大嵌套层数是任意的,但实际上一般不会超过 4-8 层。
- 层次结构可以而且确实会发生变化,尽管这种情况很少发生。任何节点都可以移动到任何其他节点,但有明显的例外(父节点不能移动到自己的子节点等)。
- 如果这还没有暗示:我确实可以控制设计并可以添加索引,字段、表格,以及获得最佳结果可能需要的任何内容。
我的“梦想”是向用户提供即时反馈,就像在渐进式搜索/过滤器中一样,但我知道这可能是不可能的或极其困难。我很高兴对当前方法有任何重大改进,当前方法通常需要 0.5 秒到 1 秒,具体取决于结果的数量。
为了完整起见,现有查询(存储过程)首先收集包含最终术语的所有叶节点,然后向上连接并排除路径与先前术语不匹配的任何叶节点。如果这对任何人来说都显得落后,请放心,这比从根部开始并呈扇形展开要有效得多。这是“旧”方式,每次搜索很容易花费几秒钟的时间。
所以我的问题又来了:是否有更好(更有效)的方法来执行此搜索?
我不一定要寻找代码,而只是寻找方法。我考虑了几种可能性,但它们似乎都有一些问题:
- 创建一个分隔的“路径文本”列并使用全文搜索对其进行索引。问题在于,对该列的搜索也会返回所有子节点; “aaron house”还匹配“aaron house kitchen”和“aaron house地下室”。
- 使用 CLR 类型创建了一个
NamePath
列,该列实际上是一个嵌套的字符串序列,类似于hierarchyid
本身。问题是,我不知道 Microsoft 如何能够将这种类型的查询“翻译”为索引操作,而且我什至不确定是否可以在 UDT 上执行此操作。如果最终结果只是完整的索引扫描,那么我通过这种方法一无所获。
如果我不能做得比我已经拥有的更好,那也不是世界末日。搜索“相当快”,没有人对此抱怨。但我敢打赌,以前有人已经解决过这个问题并且有一些想法。请分享!
I'm now in the final stages of upgrading the hierarchy design in a major transactional system, and I have been staring for a while at this 150-line query (which I'll spare you all the tedium of reading) and thinking that there has got to be a better way.
A quick summary of the question is as follows:
How would you implement a hierarchical search that matches several search terms at different levels in the hierarchy, optimized for fastest search time?
I found a somewhat related question, but it's really only about 20% of the answer I actually need. Here is the full scenario/specification:
- The end goal is to find one or several arbitrary items at arbitrary positions in the hierarchy.
- The complete hierarchy is about 80,000 nodes, projected to grow up to 1M within a few years.
- The full text of an entire path down the hierarchy is unique and descriptive; however, the text of an individual node may not be. This is a business reality, and not a decision that was made lightly.
- Example: a node might have a name like "Door", which is meaningless by itself, but the full context, "Aaron > House > Living Room > Liquor Cabinet > Door", has clear meaning, it describes a specific door in a specific location. (Note that this is just an example, the real design is far less trivial)
- In order to find this specific door, a user might type "aaron liquor door", which would likely turn up only one result. The query is translated as a sequence: An item containing the text "door", under an item containing the text "liquor", under another item containing the text "aaron."
- Or, a user might just type "house liquor" to list all the liquor cabinets in people's houses (wouldn't that be nice). I mention this example explicitly to indicate that the search need not match any particular root or leaf level. This user knows exactly which door he is looking for, but can't remember offhand who owns it, and would remember if the name popped up in front of him.
- All terms must be matched in the specified sequence, but as the above examples suggest, levels in the hierarchy can be "skipped." The term "aaron booze cabinet" would not match this node.
- The platform is SQL Server 2008, but I believe that this is a platform-independent problem and would prefer not to restrict answers to that platform.
- The hierarchy itself is based on
hierarchyid
(materialized path), indexed both breadth-first and depth-first. Each hierarchy node/record has aName
column which is to be queried on. Hierarchy queries based on the node are extremely fast, so don't worry about those. - There is no strict hierarchy - a root may contain no nodes at all or may contain 30 subtrees fanning out to 10,000 leaf nodes.
- The maximum nesting is arbitrary, but in practice it tends to be no more than 4-8 levels.
- The hierarchy can and does change, although infrequently. Any node can be moved to any other node, with the obvious exceptions (parent can't be moved into its own child, etc.)
- In case this wasn't already implied: I do have control over the design and can add indexes, fields, tables, whatever might be necessary to get the best results.
My "dream" is to provide instant feedback to the user, as in a progressive search/filter, but I understand that this may be impossible or extremely difficult. I'd be happy with any significant improvement over the current method, which usually takes between 0.5s to 1s depending on the number of results.
For the sake of completeness, the existing query (stored procedure) starts by gathering all leaf nodes containing the final term, then joins upward and excludes any whose paths don't match with the earlier terms. If this seems backward to anyone, rest assured, it is a great deal more efficient than starting with the roots and fanning out. That was the "old" way and could easily take several seconds per search.
So my question again: Is there a better (more efficient) way to perform this search?
I'm not necessarily looking for code, just approaches. I have considered a few possibilities but they all seem to have some problems:
- Create a delimited "path text" column and index it with Full-Text Search. The trouble is that a search on this column would return all child nodes as well; "aaron house" also matches "aaron house kitchen" and "aaron house basement".
- Created a
NamePath
column that is actually a nested sequence of strings, using a CLR type, similar tohierarchyid
itself. Problem is, I have no idea how Microsoft is able to "translate" queries on this type to index operations, and I'm not even sure if it's possible to do on a UDT. If the net result is just a full index scan, I've gained nothing by this approach.
It's not really the end of the world if I can't do better than what I already have; the search is "pretty fast", nobody has complained about it. But I'm willing to bet that somebody has tackled this problem before and has some ideas. Please share them!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
看看 Apache Lucene。您可以使用 Lucene 实现非常灵活且高效的搜索。它可能有用。
另请查看搜索模式 - 您所描述的内容可能适合分面搜索模式。
实现一个简单的“Aaron House Living Door”算法非常容易,但不确定基于常规 SVM/分类/熵的算法是否可以扩展到大型数据集。您可能还想了解 Motwani 和 Raghavan 的“近似搜索”概念。
如果可能的话,请发回您发现的内容:-)
take a look at Apache Lucene. You can implement very flexible yet efficient searches using Lucene. It may be useful.
Also take a look at the Search Patterns - What you are describing may fit into the Faceted Search pattern.
It is quite easy implement a trivial "Aaron House Living Door" algorithm, but not sure the regular SVM/classification/entropy based algorithms would scale to a large data set. You may also want to look at the "approximation search" concepts by Motwani and Raghavan.
Please post back what you find, if possible :-)
嗨,亚伦,我有以下想法:
根据你的描述,我脑海中浮现出以下画面:
这就是您的搜索树的样子。现在我将对每个级别的节点进行排序:
现在处理查询应该简单快捷:
设 M 为总体匹配数(与查询中最后一个单词匹配的节点数)。那么我们的处理时间就是: O( (log N)^2 + M * (log N) ):
二分查找每个级别需要 O(log N) 时间,并且有 O(log N) 个级别,因此我们必须花费至少 O( (log N)^2 ) 时间。现在,对于每个匹配,我们必须测试从匹配节点到根的完整路径是否与完整查询匹配。路径的长度为 O(log N)。因此,给定 M 总体匹配,我们又花费 M * O(log N) 时间,因此最终的执行时间为 O( (log N)^2 + M * (log N) )。
当匹配项很少时,处理时间接近 O( (log N)^2 ),这非常好。相反,如果发生最坏的情况(每条路径都匹配查询(M = N)),则处理时间接近 O(N log N),这不太好,但也不太可能。
实施:
你说,你只想要一个想法。另外我对数据库的了解非常有限,所以我不会在这里写太多,只是概述一些想法。
节点表可能如下所示:
该表必须按“文本”列排序。使用上述算法,循环内的 SQL 查询可能如下所示:
从节点中选择 ID,其中 Level = $i AND Text LIKE $text
希望有人能明白我的观点。
人们不仅可以通过“文本”列对表进行排序,还可以通过组合“文本”和“级别”列对表进行排序,即对 Level=20 内的所有条目进行排序,对 Level=20 内的所有条目进行排序,从而进一步加快处理速度。 19 排序等(但不需要对整个表进行整体排序)。然而,每个级别的节点数为 O(N),因此没有渐近运行时改进,但考虑到您在现实中获得的较低常数,我认为值得尝试。
编辑:改进
我刚刚注意到,迭代算法是完全不必要的(因此可以放弃级别信息)。完全足以:
这将运行时间提高到 O(log N + M * (log N))。
Hi Aarron, I have the following idea:
From your description I have the following image in my mind:
This is how your search tree might look like. Now I would sort the nodes on every level:
Now it should be easy and fast to process a query:
Let be M the number of overall matches (number of nodes matching the last word in the query). Then our processing time is: O( (log N)^2 + M * (log N) ):
Binary search takes O(log N) time per level and there are O(log N) levels, therefore we have to spend at least O( (log N)^2 ) time. Now, for every match, we have to test, whether the complete path from our matching node up to the root matches the complete query. The path has length O(log N). Thus, given M matches overall, we spend another M * O(log N) time, thus the resulting execution time is O( (log N)^2 + M * (log N) ).
When You have few matches, the processing time approaches O( (log N)^2 ), which is pretty good. On the opposite if the worst case occurs (every single path matches the query (M = N)), the processing time approaches O(N log N) which is not too good, but also not too likely.
Implementation:
You said, that You only wanted an idea. Further my knowledge on databases is very limited, so I won't write much here, just outline some ideas.
The node table could look like this:
This table would have to be sorted by the "Text" column. Using the algorithm described above a sql query inside the loop might look like:
SELECT ID FROM node WHERE Level = $i AND Text LIKE $text
Hope one can get my point.
One could speed things even more up, by not only sorting the table by the "Text" column, but by the combined "Text" and "Level" columns, that is, all entries within Level=20 sorted, all entries within Level=19 sorted etc.(but no overall sorting over the complete table necessary). However, the node count PER LEVEL is in O(N), so there is no asymptotic runtime improvement, but I think it's worth to try out, considering the lower constants You get in reality.
Edit: Improvement
I just noticed, that the iterative algorithm is completely unnecessary(thus the Level information can be abandoned). It is fully sufficient to:
This improves the runtime to O(log N + M * (log N)).