在kubernetes上运行的pyspark结构化流上应用机架意识,并从AWS MSK读取
我在以下设置中有一个pyspark结构化流式应用程序:
Pyspark -3.0.1版,使用Spark Operator在AWS EKS上运行。
KAFKA-在AWS MSK上运行2.8.1和replica.selector.class.class = org.apache.kafka.kafka.common.replica.rackawarereplicaselector
在集群配置(IE EE)(IE机架意识在群集侧启用)。
流量:
该应用程序从Kafka读取,在间隔5分钟内进行批处理处理,然后再次写给Kafka。我的MSK群集和运行我的火花执行者实例的ASG都在同一AZ上传播。 我希望利用机架意识机制允许火花执行者从最接近的复制品中阅读。
我希望做类似以下操作的事情:
- 当在新吊舱上产生新执行者时,请提取经纪人。Rack与相同的AZ相对应。
- 将该broker.rack注入环境变量,并用客户端将Spark Kafka消费者初始化。RACK匹配该经纪人。rack参数。
这可能吗?还是其他解决方案?
I have a Pyspark Structured Streaming application in the following setup:
Pyspark - version 3.0.1, running on AWS EKS using the Spark operator.
Kafka - running on AWS MSK with a cluster running Apache Kafka version of 2.8.1 and replica.selector.class=org.apache.kafka.common.replica.RackAwareReplicaSelector
is configured at the cluster configurations (i.e rack awareness is enabled on cluster side).
The flow:
The application reads from Kafka, performs batch processing in 5 minutes intervals, and writes to Kafka again. Both my MSK cluster and the ASG running the instances of my Spark executors are spread on the same AZ's.
I wish to leverage the rack awareness mechanism to allow the Spark executors to read from the closest replica.
I wish to do something like the following:
- When spawning new executors on new pods, extract the broker.rack corresponding to the same AZ.
- Inject that broker.rack as an environmental variable and initialize the Spark kafka consumer with a client.rack matching that broker.rack parameter.
Is this possible? Or any other solution?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论