Solandra 分片:内部想法
刚刚开始使用 Solandra,并试图了解第二个 Solandra 分片的级别详细信息。
AFAIK Soalndra 创建配置的分片数量(如 “solandra.shards.at.once”属性),其中每个分片的大小为 “solandra.maximum.docs.per.shard”。
在下一个级别开始 在每个分片内创建插槽,其定义为 “solandra.maximum.docs.per.shard”/“solandra.index.id.reserve.size”。
我从 SchemaInfo CF 的数据模型中了解到 特定分片有不同物理节点拥有的插槽,并且 这是节点之间为获得这些插槽而发生的竞赛。
我的问题是:
这是否意味着我请求在特定 solr 节点上写入 例如 .
....solandra/abc/dataimport?command=full-import
执行此请求 被分发到所有可能的节点等。这是分布式写入吗? 因为在那之前,其他节点将如何竞争 特定分片内的插槽。理想情况下,用于编写文档或 一组文档将在单个物理 JVM 上执行。通过分片,我们尝试在单个物理节点上编写一些文档 但如果它是基于不同拥有的插槽进行写入 物理节点,我们实际实现了什么,因为我们再次需要 从不同节点获取结果。我明白写 吞吐量最大化。
我们可以考虑调整这些数字吗? “
solandra.maximum.docs.per.shard
”, "solandra.index.id.reserve.size","solandra.shards.at.once
" 。如果我在单个 DC 中只有一个分片且复制因子为 5 6个节点设置,我看到这个分片的端点包含5个 根据复制因子的端点。但是第六个会发生什么 一。我通过nodetool看到左边第6个节点并没有真正得到 任何数据。如果我将复制因子增加到 6,同时保持 集群上,这会解决问题并进行修复等吗? 有更好的方法。
Just got started on Solandra and was trying to understand the 2nd
level details of Solandra sharding.
AFAIK Soalndra creates number of shards configured (as
"solandra.shards.at.once" property) where each shard is up to size of
"solandra.maximum.docs.per.shard".
On the next level it starts
creating slots inside each shard which are defined by
"solandra.maximum.docs.per.shard"/"solandra.index.id.reserve.size".
What I understood from the datamodel of SchemaInfo CF that inside a
particular shard there are slots owned by different physical nodes and
these is a race happening between nodes to get these slots.
My questions are:
Does this mean if I request write on a particular solr node
eg .....solandra/abc/dataimport?command=full-import
does this request
gets distributed to all possible nodes etc. Is this distributed write?
Because until that happens how would other nodes be competing for
slots inside a particular shard.Ideally the code for writing a doc or
set of docs would be getting executed on a single physical JVM.By sharding we tried to write some docs on the single physical node
but if it is writing based on the slots which are owned by different
physical nodes , what did we actually achieved as we again need to
fetch results from different nodes. I understand that the write
throughput is maximized.Can we look into tuning these numbers -?
"solandra.maximum.docs.per.shard
" ,
"solandra.index.id.reserve.size","solandra.shards.at.once
" .If I have just one shard and replication factor as 5 in a single DC
6 node setup, I saw that the endpoints of this shard contain 5
endpoints as per the Replication Factor.But what happens to the 6th
one. I saw through nodetool that the left 6th node doesn't really get
any data. If I increase the replication factor to 6 while keeping the
cluster on , will this solve the problem and doing repair etc or is
there a better way.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
总的来说,shards.at.once 参数用于控制索引的并行性。该数字越高,一次写入的分片就越多。如果您将其设置为 1,您将始终只写入一个分片。通常应设置为 20% >集群中的节点数。因此,对于四节点集群,将其设置为 5。
储备规模越大,节点之间需要的协调就越少。因此,如果您知道自己有很多文件要写,请提出这个问题。
docs.per.shard 越高,对给定分片的查询就会变得越慢。一般来说,最大值应为 1-5M。
回答你的观点:
这只会从一个节点导入。但它会同时根据多个分片建立索引。
我认为问题是你应该跨所有节点写入吗?是的。
是的,见上文。
如果您增加 shards.at.once,这将快速填充
Overall the shards.at.once param is used to control parallelism of indexing. the higher that number the more shards are written to at once. If you set it to one you will always to writing to only one shard. Normally this should be set to 20% > the number of nodes in the cluster. so for a four node cluster set it to five.
The higher the reserve size, the less coordination between the nodes is needed. so if you know you have lots of documents to write then raise this.
The higher the docs.per.shard the slower the queries for a given shard will become. In general this should be 1-5M max.
To answer your points:
This will only import from one node. but it will index across many depending on shards at once.
I think the question is should you write across all nodes? Yes.
Yes see above.
If you increase shards.at.once this will be populated quickly