使用 Hadoop 进行导入操作并使用 Solr 提供 Web 界面是否有意义?
我正在考虑将大量数据实时导入 Lucene 索引的需要。这将包含各种格式的文件(Doc、Docx、Pdf 等)。
数据将作为批量压缩文件导入,因此需要将它们解压缩并索引到单个文件中,并以某种方式与整个文件批次相关。
我仍在尝试找出如何实现这一点,但我认为我可以使用 Hadoop 进行处理并导入到 lucene 中。然后我可以使用 Solr 作为 Web 界面。
既然 Solr 已经可以处理数据了,我是否让事情变得过于复杂了?由于导入的 CPU 负载非常高(由于预处理),我相信无论实现如何,我都需要将导入和临时搜索分开。
问:“请定义大量数据并实时”
“大量”数据是指每年 10 亿封电子邮件(或更多),平均大小为 1K,附件大小从 1K 到20 Megs,少量数据范围为 20 Megs 到 200 Megs。这些通常是需要上面引用的索引的附件。
实时意味着它支持在准备导入后 30 分钟或更短时间内进行搜索。
SLA:
我希望为搜索操作提供 15 秒或更短的搜索 SLA。
I'm looking at the need to import a lot of data in realtime into a Lucene index. This will consist of files of various formats (Doc, Docx, Pdf, etc).
The data will be imported as batches compressed files, and so they will need to be decompressed and indexed into an individual file, and somehow relate to the file batch as a whole.
I'm still trying to figure out how to accomplish this, but I think I can use Hadoop for the processing and import into lucene. I can then use Solr as a web interface.
Am I over complicating things since Solr can already process data? Since the CPU load for import is very high (due to pre processing) I believe I need to separate import and casual searching regardless of the implementation.
Q: "Please define a lot of data and realtime"
"A lot" of data is 1 Billion email messages per year (or more), with an average size of 1K, with attachments ranging from 1K to 20 Megs with a small amount of data ranging from 20 Megs to 200 Megs. These are typically attachments that need indexing referenced above.
Realtime means it supports searching within 30 minutes or sooner after it is ready for import.
SLA:
I'd like to provide a search SLA of 15 seconds or less for searching operations.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
如果您需要实时(或接近实时)完成处理,那么 Hadoop 可能不是您的最佳选择。
Solr 已经处理了处理和索引文件的所有方面。我会首先坚持仅使用 Solr 的解决方案。 Solr 允许您扩展到多台机器,因此如果您发现 CPU 负载因处理而过高,那么您可以轻松添加更多机器来处理负载。
If you need the processing done in real-time (or near real-time for that matter) then Hadoop may not be the best choice for you.
Solr already handles all aspects of processing and indexing the files. I would stick with a Solr-only solution first. Solr allows you to scale to multiple machines, so if you find that the CPU load is too high because of the processing, then you can easily add more machines to handle the load.
我建议您使用 Solr Replication 来减轻负载,方法是在一台机器上建立索引并从其他的。 Hadoop不适合实时处理。
I suggest that you use Solr Replication to ease the load, by indexing on one machine and retrieving from others. Hadoop is not suitable for real-time processing.
每年 10 亿份文档相当于每秒约 32 个文档均匀传播。
您可以在单独的计算机上运行文本提取并将可索引文本发送到 Solr。我想,在这种规模上,您应该选择多核 Solr。因此,您可以将可索引内容发送到不同的核心。这应该会加快索引速度。
我已经在单核上完成了 1 亿范围内的小型结构化文档的索引,没有遇到太多麻烦。您应该能够使用单个 solr 实例扩展到几亿个文档。 (文本提取服务可以使用另一台机器。)
在Hathi Trust 的博客上阅读有关大规模搜索的各种信息挑战和解决方案。他们使用 Lucene/Solr。
1 billion documents per year translates to approximately 32 documents per second spread uniformly.
You could run text extraction on a separate machine and send the indexable text to Solr. I suppose, at this scale, you have go for multi-core Solr. So, you can send indexable content to different cores. That should speed up indexing.
I have done indexing of small structured documents in the range of 100 million without much trouble on a single core. You should be able scale to few 100 million documents with a single solr instance. (The text extraction service could use another machine.)
Read about large scale search on Hathi Trust's blog for various challanges and solutions. They use Lucene/Solr.