GCP 的 kubernetes pod 上的 Pub/Sub 拉取请求数量急剧减少
我的积压 gcp pub/sub 订阅中有 5M~ 消息(总共 7GB~),并且希望提取尽可能多的消息。我正在使用具有以下设置的同步拉取,并等待 3 分钟来堆积消息并发送到另一个数据库。
defaultSettings := &pubsub.ReceiveSettings{
MaxExtension: 10 * time.Minute,
MaxOutstandingMessages: 100000,
MaxOutstandingBytes: 128e6, // 128 MB
NumGoroutines: 1,
Synchronous: true,
}
问题是,如果我的 kubernetes 集群上有大约 5 个 pod,那么 pod 几乎每轮(3 分钟时间段)都能够拉近 90k~ 消息。但是,当我在第一轮或第二轮中将 pod 数量增加到 20 时每个 pod 都能够检索 90k~ 消息,但是一段时间后,拉取请求计数会急剧下降,并且每个 pod 在每轮中都会收到 1k-5k~ 消息。我研究了 go 库同步拉取机制,并知道如果没有成功确认消息,您将无法请求新消息,因此拉取请求计数可能会下降以防止超过 MaxOutstandingMessages ,但我正在将我的消息缩小到零pod 启动新的 pod,而我的订阅中仍然有数百万条未确认的消息,并且它们在 3 分钟内收到的消息数量仍然非常少,5 或 20 个 pod 并不重要。大约 20-30 分钟后,他们再次收到 90k~ 条消息,然后在一段时间后再次下降到非常低的水平(从指标页面检查)。另一个有趣的事情是,虽然我的新 Pod 收到的消息数量非常少,但连接到同一订阅的本地计算机在每一轮中都会收到 90k~ 消息。
我读过 pubsub 的配额和限制页面,带宽配额非常高(大区域为每分钟 240,000,000 kB(4 GB/s))。我尝试了很多事情,但无法理解为什么拉取请求计数会在我启动新的 Pod 时大幅下降。 gcp 或 pub/sub 端的 kubernetes 集群节点是否存在某些连接或带宽限制?接收大量消息对于我的任务至关重要。
I have 5M~ messages (total 7GB~) on my backlog gcp pub/sub subscription and want to pull as many as possible of them. I am using synchronous pull with settings below and waiting for 3 minutes to pile up messages and sent to another db.
defaultSettings := &pubsub.ReceiveSettings{
MaxExtension: 10 * time.Minute,
MaxOutstandingMessages: 100000,
MaxOutstandingBytes: 128e6, // 128 MB
NumGoroutines: 1,
Synchronous: true,
}
Problem is that if I have around 5 pods on my kubernetes cluster pods are able to pull nearly 90k~ messages almost in each round (3 minutes period).However, when I increase the number of pods to 20 in the first or the second round each pods able to retrieve 90k~ messages however after a while somehow pull request count drastically drops and each pods receives 1k-5k~ messages in each round. I have investigated the go library sync pull mechanism and know that without acking successfully messages you are not able to request for new ones so pull request count may drop to prevent exceed MaxOutstandingMessages
but I am scaling down to zero my pods to start fresh pods while there are still millions of unacked messages in my subscription and they still gets very low number of messages in 3 minutes with 5 or 20 pods does not matter. After around 20-30 minutes they receives again 90k~ messages each and then again drops to very low levels after a while (checking from metrics page). Another interesting thing is that while my fresh pods receives very low number of messages, my local computer connected to same subscription gets 90k~ messages in each round.
I have read the quotas and limits page of pubsub, bandwith quotas are extremely high (240,000,000 kB per minute (4 GB/s) in large regions) . I tried a lot of things but couldn't understand why pull request counts drops massively in case I am starting fresh pods. Is there some connection or bandwith limitation for kubernetes cluster nodes on gcp or on pub/sub side? Receiving messages in high volume is critical for my task.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您使用同步拉取,我建议使用
StreamingPull
用于您的规模 Pub/Sub 使用情况。预计,对于高吞吐量场景和同步拉取,应该始终有许多空闲请求。
同步拉取请求建立与一台特定服务器(进程)的连接。高吞吐量主题由许多服务器处理。进来的消息只会发送到少数服务器,从 3 到 5 台。这些服务器应该有一个已连接的空闲进程,以便能够快速转发消息。
该过程与基于 CPU 的缩放发生冲突。空闲连接不会导致 CPU 负载。至少,每个 Pod 的线程数应该多于 10 个,才能实现基于 CPU 的扩展。
此外,您还可以使用
Horizontal-Pod-Autoscaler( HPA)
配置用于使用 Pub/Sub 的 GKE Pod。使用 HPA,您可以配置 CPU 使用率。我的最后建议是考虑
Dataflow
适合您的工作量。从 PubSub 消费。If you are using synchronous pull, I suggest using
StreamingPull
for your scale Pub/Sub usage.It is expected that, for a high throughput scenario and synchronous pull, there should always be many idle requests.
A synchronous pull request establishes a connection to one specific server (process). A high throughput topic is handled by many servers. Messages coming in will go to only a few servers, from 3 to 5. Those servers should have an idle process already connected, to be able to quickly forward messages.
The process conflicts with CPU based scaling. Idle connections don't cause CPU load. At least, there should be many more threads per pod than 10 to make CPU-based scaling work.
Also, you can use
Horizontal-Pod-Autoscaler(HPA)
configured for Pub/Sub consuming GKE pods. With the HPA, you can configure CPU usage.My last recommendation would be to consider
Dataflow
for your workload. Consuming from PubSub.