将 Hibernate 搜索结果与关系数据库查询合并
我有一个复杂的查询,需要对某些字段进行全文搜索,并对其他字段进行基本限制。 Hibernate Search 文档强烈建议不要添加数据库查询限制为全文搜索查询,而是 建议将所有必要字段放入全文索引中。我遇到的问题是其他字段是不稳定的;值可能每分钟左右发生变化,并且对数据库的更新可能发生在执行搜索的 JVM 之外,因此本地 Lucene 索引很可能对于这些字段来说已经过时。
在这里寻找策略建议。到目前为止,我想到的最好方法是首先执行数据库查询(仅获取对象 ID),然后执行全文搜索,从而手动连接结果。并以某种方式通过数据库中的对象 ID 集有效地过滤 Lucene 结果。当然,我不知道每个单独的查询会得到多少结果,所以我担心性能和内存。在最坏的情况下,每行可能有数万行。
I have a complex query that requires a full-text search on some fields and basic restrictions on other fields. Hibernate Search documentation strongly advises against adding database query restrictions to a full text search query and instead recommends putting all of the necessary fields into the full-text index. The problem I have with that is that the other fields are volatile; values can change every minute or so and those updates to the database may occur outside of the JVM doing the search, so there is a high likelihood that the local Lucene index would be out of date with respect to those fields.
Looking for strategy recommendations here. The best I've come up with so far is to join the results manually by first executing the database query (fetching only object IDs) and then execute the full text search. and somehow efficiently filter the Lucene results by the set of object IDs from the database. Of course, I don't know how many results I'll get from each separate query, so I'm worried about performance and memory. It could be tens of thousands of rows apiece in the worst case.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
我对其他想法非常感兴趣,因为我们有一个非常相似的场景。
我们最多只需要显示 50 个结果行,每行进行几次查找。我们使用索引中的 db pk id 对 lucene 索引运行查询,并从数据库中每行提取查找。它对我们来说仍然具有高性能。
由于您似乎想要处理多于几行和查找,我确实考虑了一种替代方案。对任何数据库行更新添加时间戳。这将允许我们查询数据库中的过时索引,然后迭代调用相关文档的重新索引。
I am quite interested in other ideas for this as we have a very similar scenario.
We only needed to show 50 results rows as a maximum with a couple of lookups per row. We run the query against the lucene index with the db pk ids in the index and the pull the lookups out of the database per row. It's still performant for us.
As you seem to want to process more than a few rows and lookups I did consider an alternative. Timestamp any db row updates. This would allow us to query the DB for stale indexes and then iteratively call a reindex of related documents.
我有同样的问题,并做了一个单独的 Lucene 和条件查询。如果我首先执行条件查询,我将使用生成的 ids 为 Lucene 搜索应用自定义 IdFilter,该过滤器检查结果是否在第一个查询的给定 Id 集合中。然而,这种方法不能很好地扩展,因为在我的例子中,第一次查询后的结果数量可能很大,并且过滤器仅限于 1024 个 id。我没有找到好的解决方案,但我根据预期结果的数量更改了两个查询的顺序。第一个查询应该是过滤掉大部分结果的查询。
I have the same problem and do a separate Lucene and criteria query. If I first do the criteria query I will use the resulting ids to apply a custom IdFilter for Lucene search which checks whether the result is in the given Id collection from the first query. However this approach does not scale well because also in my case the number of results after the first query might be huge and the filter is limited to 1024 ids. I did not find a good solution but I change the order of my two queries depending on the number of the to be expected results. The first query should be the one which filters out most of the results.
您可以根据上次修改日期进行调度程序索引更新。
You can do a scheduler index update base on the last modified date.