Solrnet/Solr(Lucene)中如何实现分组?

发布于 2024-09-13 23:51:31 字数 690 浏览 1 评论 0原文

我有根据 pageIds (UniqueKey) 索引的 Lucene 文件。一个文档可以有多个页面。现在,一旦用户执行一些搜索,它就会为我们提供与搜索条件匹配的页面。

我正在使用 Lucene.Net 2.9.2

我们有 2 个问题...

1- 文件大小约为 800GB,有 1.3 亿行(页),因此搜索时间非常慢(所有查询都花费超过一分钟(我们一次只需要返回有限的行)

为了克服性能问题,我转向了 SOLR,它解决了性能问题(这很奇怪,因为我没有使用 SOLR 提供的任何额外功能,如分片等 - 所以可能是 Lucene .NET 2.9.2 与相同版本的 JAVA 相比,性能并不真正相同??)但现在我遇到了另一个问题...

2- 单个“lucene 文档”是一页,但我想显示“按分组”分组的结果'“真实文档”。我应该返回多少结果应该基于“真实文档”而不是“页面”进行配置(因为这就是我想要向用户显示的方式)

。其中符合搜索条件的所有页面(如果一个文档有 100 页而另一个只有 1 页,这并不重要),

我从 SOLR 论坛得到的是,它可以通过 SOLR-236 补丁(字段折叠)来实现,但我有无法通过 trunk 正确应用补丁(产生很多错误)。

这对我来说真的很重要,我没有太多时间,所以有人可以向我发送应用了此补丁的 SOLR 1.4.1 二进制文件,或者指导我(如果有其他方法)。

我真的很感激。谢谢!!

I have Lucene files indexed according to pageIds (UniqueKey). and one document can have multiple pages. Now once user perform some search it gives us pages that matches search criteria.

I am using Lucene.Net 2.9.2

We have 2 problems...

1- The file size is around 800GB and it has 130 million rows (pages) so the search time was really slow (all queries taking more than a min (we only have to return limited rows at a time)

To overcome the performance issue I shifted to SOLR which resolved the performance issue (which is quite strange as I am not using any extra functionality provided by SOLR like sharding etc - so could it be that Lucene.NET 2.9.2 is not really equivalent to performance compared to same version of JAVA??) but now I am having another issue...

2- The individual 'lucene document' is one page but i want to show results 'grouped by' 'real documents'. How many results I should be returned should be configurable based on 'real documents' not 'pages' (coz thats how I want to show to the user).

So lets say I want 20 'real documents' and ALL pages in them that matches the search criteria (doesnt matter if one document has 100 pages and another just 1).

From what I could get from SOLR forums was that it can be achieved by SOLR-236 patch (field collapsing) but I have not been able to apply the patch correctly with trunk (gives lots of errors).

This is really imp for me and I dont have much time, so can someone please either send me the SOLR 1.4.1 binary with this patch applied or guide me if there is any other way.

I would really appreciate it. Thanks!!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

绝不服输 2024-09-20 23:51:31

如果您对崩溃补丁有问题,那么 Solr 问题跟踪器是报告这些问题的渠道。我可以看到其他人目前遇到了一些问题,所以我建议参与其开发。

也就是说:我建议,如果您的应用程序需要搜索“真实文档”,则围绕这些“真实文档”而不是它们的各个页面构建索引。

If you have issues with the collapse patch, then the Solr issue tracker is the channel to report them. I can see that other people are currently having some issues with it, so I suggest getting involved in its development.

That said: I recommend that if your application needs to search for 'real documents', then build your index around these 'real documents', not their individual pages.

如果您唯一的要求是显示页码,我建议使用荧光笔或进行一些自定义开发。您可以将每个页面的开头和结尾的字数存储在自定义结构中,并且知道匹配的字在整个文档中的位置就可以知道它出现在哪个页面。如果文档非常大,您将获得良好的性能提升。

If your only requirement is to show page numbers, I would suggest to play with the highlighter or made some custom development. You can store the word number of the start and end of each page in a custom structure, and knowing the matched word position in the whole document you can know in what page it appears. If the documents are very large you will get a good performance improvement.

意中人 2024-09-20 23:51:31

您还可以看看 SOLR-1682 : Implement CollapseComponent,我还没有测试过还没有,但据我所知它也解决了崩溃问题。

You could also have a look at SOLR-1682 : Implement CollapseComponent, I havent tested it yet, but as far as I know it solves the collapsing too.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文