Node重新启动后,Kubernetes似乎不会更新内部IP表
目前,我们正在使用GCP Kubernetes遇到问题,该问题将客户端请求转发到已分配了群集中其他POD的IPS的POD,以前 /以前 /获得。我们可以看到的方式是通过在Logs Explorer中使用以下查询:
resource.type="http_load_balancer"
httpRequest.requestMethod="GET"
httpRequest.status=404
摘要中的一个日志:
httpRequest: {
latency: "0.017669s"
referer: "https://asdf.com/"
remoteIp: "5.57.50.217"
requestMethod: "GET"
requestSize: "34"
requestUrl: "https://[asdf.com]/api/service2/[...]"
responseSize: "13"
serverIp: "10.19.160.16"
status: 404
userAgent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
... requesturl
属性属性指示了对负载平衡器的传入URL。
然后,我搜索IP 10.19.160.16以找出分配给IP的POD:
c:\>kubectl get pods -o wide | findstr 10.19.160.16
service1-675bfc4f97-slq6g 1/1 Terminated 0 40h 10.19.160.16 gke-namespace-te-namespace-te-153a9649-p2mg
service2-574d69cf69-c7knp 0/1 Error 0 3d16h 10.19.160.16 gke-namespace-te-namespace-te-153a9649-p2mg
service3-6db4c97784-428pq 1/1 Running 0 16h 10.19.160.16 gke-namespace-te-namespace-te-153a9649-p2mg
因此,基于requesturl
该请求应已发送到Service2。取而代之的是,我们看到的是它已发送到Service3,因为它已经获得了Service2 曾经曾经拥有的IP,换句话说,群集似乎仍然认为Service2坚持IP 10.19 .160.16。效果是Service3返回状态代码404,因为它无法识别端点。
仅当我们使用kubectl delete pod ...
命令中,我们以失败状态(例如错误或终止)手动删除POD时,此行为才会停止。
我们怀疑这种行为是自从我们将群集升级到v1.23以来就开始的,这要求我们从extensions/v1beta1
networking.k8s.io/v1
如图所述。 https://cloud.google.com/kuberle.com/kubernetes-engengine/文档/折旧/apis-1-22 。
我们的测试环境是使用可预取的VM,尽管我们不是100%(但很接近)肯定的是,在节点被抢占后,Pods似乎以错误状态结束。
为什么群集仍然认为死舱仍然具有过去的IP?为什么删除失败的豆荚后出现问题?节点预先抢占后,不应该清理它们吗?
We're currently experiencing an issue with our GCP Kubernetes which is forwarding client requests to pods that have been assigned IPs that other pods within the cluster have /previously/ gotten. The way we can see this is by using the following query in Logs Explorer:
resource.type="http_load_balancer"
httpRequest.requestMethod="GET"
httpRequest.status=404
Snippet from one of the logs:
httpRequest: {
latency: "0.017669s"
referer: "https://asdf.com/"
remoteIp: "5.57.50.217"
requestMethod: "GET"
requestSize: "34"
requestUrl: "https://[asdf.com]/api/service2/[...]"
responseSize: "13"
serverIp: "10.19.160.16"
status: 404
userAgent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
...where the requestUrl
property indicates the incoming URL to the load balancer.
Then I search for the IP 10.19.160.16 to find out which pod the IP is assigned to:
c:\>kubectl get pods -o wide | findstr 10.19.160.16
service1-675bfc4f97-slq6g 1/1 Terminated 0 40h 10.19.160.16 gke-namespace-te-namespace-te-153a9649-p2mg
service2-574d69cf69-c7knp 0/1 Error 0 3d16h 10.19.160.16 gke-namespace-te-namespace-te-153a9649-p2mg
service3-6db4c97784-428pq 1/1 Running 0 16h 10.19.160.16 gke-namespace-te-namespace-te-153a9649-p2mg
So based on requestUrl
the request should have been sent to service2. Instead, what we see is that it gets sent to service3 because it's gotten the IP that service2 once used to have, in other words it seems that the cluster still thinks that service2 is holding on to the IP 10.19.160.16. The effect is that service3 returns status code 404 because it doesn't recognize the endpoint.
This behavior only stops if we manually delete the pods in failed state (eg Error or Terminated) by using the kubectl delete pod ...
command.
We suspect that this behavior started since we upgraded our cluster to v1.23 which required us to migrate away from extensions/v1beta1
to networking.k8s.io/v1
as described in https://cloud.google.com/kubernetes-engine/docs/deprecations/apis-1-22.
Our test environment is using pre-emptible VM and whilst we're not 100% (but pretty close) sure it seems like the pods end in Error state after a node is pre-empted.
Why does the cluster still think that a dead pod still has the IP that it used to have? Why is the problem gone after deleting failed pods? Shouldn't they have been cleaned up after a node pre-emption?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
加里·辛格(Gari Singh)在评论中提供了答案。
Gari Singh provided the answer in the comment.