特使 - 另一个控制平面实例的故障转移
在我们的环境中,Envoy通过GRPC从控制平面消耗动态配置。控制平面发现被配置为严格_dns:
- name: cplane
connect_timeout: 5s
type: STRICT_DNS
load_assignment:
cluster_name: cplane
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: control-plane-fqdn
port_value: 1234
在Control-control-plane-fqdn
DNS记录下,我们有多个实例,并且Envoy连接到其中任何一个。问题是 - 特使使用的故障转移机制是什么?从我的观察结果来看,故障转移到另一个实例(关闭特使连接到的实例)需要5到50秒。这种传播的原因是什么,可以使故障转移时间更确定性?
In our setting, Envoy consumes dynamic configuration from the control plane via GRPC. The control plane discovery is configured as STRICT_DNS:
- name: cplane
connect_timeout: 5s
type: STRICT_DNS
load_assignment:
cluster_name: cplane
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: control-plane-fqdn
port_value: 1234
Under the control-plane-fqdn
DNS record we have multiple instances and Envoy connects to any one of them. The question is - what is the failover mechanism that Envoy uses? From my observations, failover to another instance (upon shutting down the one to which Envoy is connected) takes from 5 to 50 seconds. What is the reason for this spread and is it possible to make the failover time more deterministic?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
最终找到了答案。这是此处定义的指数退缩机制, 1 < /a>。参数重新启动MaxDelaym不可配置,并且将硬编码为30秒。此外,当DNS有两个IP,其中一个变得无法到达时,Envoy试图以圆形旋转方式重新连接每个IP,并在尝试之间呈指数退缩。因此,如果辅助IP还没有准备好立即接受连接,则指数退回可以再增加30-40秒,直到重新连接为止。
Eventually found the answer. It is the exponential backoff mechanism defined here 1. The parameter RetryMaxDelayMs is not configurable and is hard coded to 30 sec. Moreover, when the DNS has two IPs and one of them becomes unreachable, Envoy tries to reconnect to each IP in a round-robin manner with an exponential backoff between attempts. So, if the secondary IP is not ready to accept connection immediately then the exponential backoff may add another 30 - 40 sec until it reconnects.