优化 Lucid/Solr 以索引大型文本文档
我正在尝试在 solr 中索引大约 300 万个文本文档。这些文件中大约 1/3 是包含大约 1-5 段文本的电子邮件。剩下的 2/3 文件每个只有几个单词和句子。
Lucid/Solr 需要近 1 小时才能完全索引我正在使用的整个数据集。我正在尝试寻找优化此问题的方法。我已将 Lucid/Solr 设置为仅提交每 100,000 个文件,并且它一次以 50,000 个文件为一组对文件进行索引。内存不再是问题,因为批处理的缘故,内存始终保持在 1GB 左右。
整个数据集必须首先建立索引。这就像一个遗留系统必须加载到新系统中,因此必须对数据建立索引并且需要尽可能快,但我不确定这次要研究哪些领域来优化。
我在想,也许有很多像“the、a、because、should、if、...”之类的小词会造成大量开销,而且只是“噪音”词。我很好奇如果我删除它们是否会大大加快索引时间。我已经查看 Lucid 文档一段时间了,但我似乎找不到一种方法来指定哪些单词不建立索引。我遇到了“停止列表”这个术语,但除了顺便提到它之外,没有看到更多内容。
有没有其他方法可以让索引速度更快,或者我只是坚持 1 小时的索引时间?
I am trying to index about 3 million text documents in solr. About 1/3 of these files are emails that have about 1-5 paragraphs of text in them. The remaining 2/3 files only have a few words to sentences each.
It takes Lucid/Solr nearly 1 hour to fully index the entire dataset I'm working with. I'm trying to find ways to optimize this. I have setup Lucid/Solr to only commit every 100,000 files, and it indexes the files in batches of 50,000 files at once. Memory isn't an issue anymore, as it consistently stays around 1GB of memory because of the batching.
The entire dataset has to be indexed initially. It's like a legacy system that has to be loaded to a new system, so the data has to be indexed and it needs to be as fast as possible, but I'm not sure what areas to look into to optimize this time.
I'm thinking that maybe there's a lot of little words like "the, a, because, should, if, ..." that are causing a lot of overhead and are just "noise" words. I am curious if I cut them out if it would drastically speed up the indexing time. I have been looking at the Lucid docs for a while, but I can't seem to find a way to specify what words not to index. I came across the term "stop list" but didn't see much more than a reference to it in passing.
Are there other ways to make this indexing go faster or am I just stuck with a 1 hour indexing time?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我们最近遇到了类似的问题。我们不能使用 solrj,因为请求和响应必须经过一些应用程序,因此我们采取以下步骤:
创建自定义 Solr 类型流式传输大文本字段!
减少客户端的内存使用:
2.1 我们使用流API(GSon流或XML Stax)来逐一读取文档。
2.2 定义自定义 Solr 字段类型:FileTextField,接受 FileHolder 作为值。 FileTextField 最终会将读取器传递给 Lucene。 Lucene将使用阅读器读取内容并添加到索引。
2.3 当文本字段太大时,首先将其解压缩到临时文件,创建 FileHolder 实例,然后将 FileHolder 实例设置为字段值。
We met similar problem recently. We can't use solrj as the request and response have to go through some applications, so we take the following steps:
Creating Custom Solr Type to Stream Large Text Field!
To reduce memory usage at client side:
2.1 We use stream api(GSon stream or XML Stax) to read doc one by one.
2.2 Define a custom Solr Field Type: FileTextField which accepts FileHolder as value. FileTextField will eventually pass a reader to Lucene. Lucene will use the reader to read content and add to index.
2.3 When the text field is too big, first uncompress it to a temp file, create a FileHolder instance, then set the FileHolder instance as field value.
从您的查询看来,索引时间对于您的应用程序确实很重要。 Solr 是一个很棒的搜索引擎,但是如果您需要超快的索引时间并且这对您来说是一个非常重要的标准,那么您应该使用 Sphinx 搜索引擎。使用 Sphinx 快速设置结果并对其进行基准测试并不需要花费太多时间。
可以有多种方法(如您提到的、停用词等)进行优化,但是无论您在索引时间方面做什么,Solr 都无法击败 Sphinx。我自己也做过基准测试。
我也非常喜欢 Solr,因为它易于使用,它开箱即用的强大功能,如 N-Gram 索引、Faceting、多核、拼写校正器以及与其他 apache 产品的集成等......但说到优化算法(无论是索引大小、索引时间等)Sphinx 太棒了!
Sphinx 也是开源的。尝试一下。
It seems from your query that Indexing time is really important for your application. Solr is a great search engine however if you need super fast indexing time and if that is a very important criteria for you, than you should go with Sphinx Search Engine. It wont take much of time for you to quickly setup and benchmark your results using Sphinx.
There can be ways (like the one you have mentioned, stopwords etc.) to optimize however whatever you do with respect to indexing time Solr won't be able to beat Sphinx. I have done benchmarking myself.
I too love Solr a lot because of its ease of use, its out of box great features like N-Gram Indexing, Faceting, Multi-core, Spelling Correctors and its integration with other apache products etc.. but when it comes to Optimized Algorithms (be it Index size, Index time etc.) Sphinx rocks!!
Sphinx too is open source. Try that out.