Lucene (Solr/Zoie/Elasticsearch) 设置的硬件要求
我正在开发一个项目,我们正在尝试引入一个搜索框架。我们即将开始开发,到目前为止我们只做了一些 poc 工作。我们正在努力估算硬件。我不确定使用单个服务器设置是否可以满足我们的性能要求,或者我们是否需要采用复制或分布式解决方案。
这是我们的主要要求
- 在半结构化数据中搜索
- 文档包含 15 个字段,所有字段均应可搜索
- 大部分是数字 ID
- 日期
- 姓名
- 索引中的 10 多个文档
- 30-40 次更新,每分钟批量
- <100 毫秒 使用多个布尔运算符进行 100 多个查询的搜索响应时间分钟
问题
1) 在单服务器设置上获得这种性能是否可行?
2) 如果不是,那么满足性能要求的适当设置是什么。
3) 我们正在考虑基于 Lucene 的几个框架,其中包括 Solr 和 Zoie。处理所描述的负载和性能要求需要什么样的分布式架构。
I am working on a project, where we are trying to introduce a searchframework. We are about to start development soon, we have only done some poc-work up till now. We are struggling with estimatesfor hardware. I am uncertain if our performance requirements can be met using a single server setup, or if we need to go for a replicated, or distrbuted solution.
Here are our main requirements
- Search in semi-structured data
- Documents contains 15 fields all of whom should be searchable
- Mostly numeric id’s
- Dates
- Names
- 10+ millions documents in index
- 30-40 updates, in batches every minute
- <100 ms response time searches with several boolean operators for 100 + queries pr minute
Questions
1) Is it feasible to get this performance on a singleserver setup?
2) If not what is an appropriate setup to meet the performance requirements.
3) We are considering several frameworks on top of Lucene, amongst them Solr and Zoie. What distributed architecture would be necessary to handle the descibed load and performance requirements.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
是的,我想是的。但这是一种“边界”(我希望你知道我的意思)
您需要的是足够的 RAM 和 CPU 能力。最后,这取决于“大”文件的大小(例如全文)以及数据库的大小。
相比之下,我使用的 lucene 有 120 万个文档、7 个字段,大部分是短字段(日期、数字……),但也包括一个大文本字段(500-5000 个字符)。这个mysql数据库(由lucene索引)的大小是1-2GB。该系统在具有 4GB RAM 的小型单 CPU VMware 主机上运行。全文搜索结果在 100-400 毫秒内返回。
如果您没有大的文本字段,您的结果将返回得更快。 (取决于搜索的类型 -> 例如分面搜索)
例如:对 char(255) 归档进行分面搜索,在 <70ms 内返回
对于您的配置,具有大量内存 (>32GB) 和 >8 个内核的非可视化硬件可能会很有用。
是否意味着每分钟 30-40 个新文档?没问题!
每分钟 30-40 次更新并包含大量新文档将更具挑战性。
另外,您应该定期优化您的索引(例如每晚)
Solr 作为 tomcat 应用程序运行。例如,您必须在这里定义分配给您的搜索引擎的 RAM(见上文)。
有不同的可能性来分割索引(为了提高性能或更快的更新),集群也是可能的。
Yes, I think so. But it's a kind of "borderline" (I hope you know, what I mean)
What you need is enough RAM and CPU power. Finlay it depends on the size of "big" fileds, like fulltexte or so and the size of your database.
In comparison I use lucene with 1.2 million docs, 7 fileds, mostly short fileds (date,numbers,..) but also including one big textfield (500-5000 characters). The size of this mysql database (which is indexed by lucene) is 1-2 GB. The System runs on an small single CPU VMware Host with 4GB of RAM. The Fulltext-Search results returned in 100-400ms.
If you don't have big textfields, your results will return faster. (depending on the kind of search -> for example facettet search)
For example: an facetet search on an char(255) Filed, returned in <70ms
Probably for your configuration an non visualized Hardware with lots of memory (>32GB) and >8 cores would be useful.
does it mean 30-40 new documents per minute? that's no problem!
30-40 updates per minute with lots of new documents would be more challenging.
Additional you should optimize your index periodically (for example nightly)
Solr is running as an tomcat application. Here you have to define for example the RAM (look above), which is assigned to your search engine.
There are different possibilities to split your index (for more performance or faster update), clustering is also possible.
如果“单服务器”满足此要求,您应该查看类似 ElasticSearch 的内容。因为它对近实时更新的优化非常好。
使用 Solr,您可以获得类似的性能,但在混合获取/更新请求时,Solr 在单节点中存在问题。通过将其拆分在 2 个或更多节点(主/从)上,您将获得与 ElasticSearch 类似的性能,但在单个节点上 — 不会。
请查看此了解更多详细信息 - http://blog.socialcast.com/实时搜索-solr-vs-elasticsearch/
In case of "singleserver" for this requirements you should look on something like ElasticSearch. Because it optimized for near-realtime updates very good.
With Solr you can receive similar performance, but exactly with mixing get/updates requests Solr have problem in single node. With splitting it on 2 or more nodes — master/slave you will receive similar performance as ElasticSearch, but on single node — no.
Please look on this for more detail — http://blog.socialcast.com/realtime-search-solr-vs-elasticsearch/