ElasticSearch 的实际限制卡桑德拉
我计划使用 ElasticSearch 来索引我的 Cassandra 数据库。我想知道是否有人看到 ElasticSearch 的实际局限性。在 PB 范围内速度会变慢吗?另外,有人在使用 ElasticSearch 索引 Cassandra 时遇到任何问题吗?
I am planning on using ElasticSearch to index my Cassandra database. I am wondering if anyone has seen the practical limits of ElasticSearch. Do things get slow in the petabyte range? Also, has anyone has any problems using ElasticSearch to index Cassandra?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
请参阅 this thread 来自 2011 年,其中提到了每个具有 1700 个分片的 ElasticSearch 配置200GB,在 1/3 PB 范围内。我希望 ElasticSearch 的架构能够支持几乎无限的水平可扩展性,因为每个分片索引都与所有其他分片分开工作。
实际限制(也适用于任何其他解决方案)包括首先实际加载这么多数据所需的时间。管理如此规模的 Cassandra 集群(或任何其他分布式数据存储)还将涉及大量维护、负载平衡等工作量。
See this thread from 2011, which mentions ElasticSearch configurations with 1700 shards each of 200GB, which would be in the 1/3 petabyte range. I would expect that the architecture of ElasticSearch would support almost limitless horizontal scalability, because each shard index works separately from all other shards.
The practical limits (which would apply to any other solution as well) include the time needed to actually load that much data in the first place. Managing a Cassandra cluster (or any other distributed datastore) of that size will also involve significant workload just for maintenance, load balancing etc.
Sonian 是 kimchy 在该帖子中提到的公司。我们在 AWS 上跨多个 ES 集群拥有超过 PB 的数据。 ES 的水平扩展程度没有技术限制,但正如 DNA 提到的,存在实际问题。迄今为止最大的是网络。它适用于所有分布式数据存储。您一次只能在电线上移动这么多。当ES必须从故障中恢复时,它必须移动数据。最好的选择是在更多节点上使用更小的分片(更多并发传输),但您会面临更高的故障率和每字节成本过高的风险。
Sonian is the company kimchy alludes to in that thread. We have over a petabyte on AWS across multiple ES clusters. There isn't a technical limitation to how far horizontally you can scale ES, but as DNA mentioned there are practical problems. The biggest by far is network. It applies to every distributed data storage. You can only move so much across the wire at a time. When ES has to recover from a failure, it has to move data. The best option is to use smaller shards across more nodes (more concurrent transfer), but you risk a higher rate of failure and exhorbitant cost per byte.
AS DNA 提到,1700 个分片,但不是 1700 个分片,而是有 1700 个索引,每个索引有 1 个分片和 1 个副本。因此,这 1700 个索引很可能不存在于单台机器上,而是分布在多台机器上。
所以这从来都不是问题
AS DNA mentioned, 1700 shards, but it is not 1700 shards but there are 1700 indexes each with 1 shard and 1 replica. So it is quite possible that these 1700 indexes are not present on single machine but are split around multiple machines.
So this is never a problem
我目前正在开始使用 Elisandra (Elasticsearch + Cassandra)
我也遇到了使用 Elasticsearch 索引 Cassandra 的问题。我的问题基本上是节点配置。
执行
$ nodetool status
,您可以看到Host ID
,然后破坏:curl -XGET http://localhost:9200/_cluster/state/?pretty=true< /code>
您可以检查
节点:
之一是否与主机ID
同名I am currently starting working with Elisandra (Elasticsearch + Cassandra)
I am also, having problems to index Cassandra with elasticsearch. My problem is basically the node configuration.
Doing
$ nodetool status
you can seeHost ID
and then ruining:curl -XGET http://localhost:9200/_cluster/state/?pretty=true
You can check that one of the
node:
is the same name asHost ID