有没有办法用 Kafka 消费中的最新消息替换旧消息（避免最终 df 中重复）

发布于 2025-01-11 13:10:01 字数 2463 浏览 5 评论 0原文

我正在使用来自某个主题的数据，正如我们所知，我们实时获取数据，其中我们看到重复的元素，如何实际上用最新消息替换旧消息。

我使用以下相同的代码从主题中使用

schema = StructType(
    [
        StructField("Id",StringType(),True),
        StructField("cTime",StringType(),True),
        StructField("latestTime",StringType(),False),
        StructField("service",StringType(),True),
    ]

topic = "topic1"
bootstrap_servers = "mrdc.it.com:9093,mrdc.it.com:9093,mrdc.it.com:9093"

options = {
    "kafka.sasl.jaas.config": 'org.apache.kafka.common.security.plain.PlainLoginModule required username="xxxxx.aud.com" password="xxxxxxxx";',\
    "kafka.ssl.ca.location": "/tmp/cert.crt",\
    "kafka.sasl.mechanism": "PLAIN",\
    "kafka.security.protocol" : "SASL_SSL",\
    "kafka.bootstrap.servers": bootstrap_servers,\
    "failOnDataLoss": "false",\
    "subscribe": topic,\
    "startingOffsets": "latest",\
    "enable.auto.commit": "false",\
    "auto.offset.reset": "false",\
    "enable.partition.eof": "true",\
    "key.deserializer": "org.apache.kafka.common.serialization.StringDeserializer",\
    "value.deserializer": "org.apache.kafka.common.serialization.StringDeserializer"
}
kafka_df = spark.readStream.format("kafka").options(**options).load()

kafka_mobile_apps_df = kafka_df.select(from_json(col("value").cast("string"), schema).alias("apps"))

df = avertack_kafka_eventhub_connections(source= "KAFKA", kafka_config=kafka_config)

sql_features = ["apps.Id",
                "apps.cTime",
                "apps.latesTime", 
                "apps.service"
               ]

kafka_df_features = df.selectExpr(sql_features)
display(kafka_df_features)

输出如图所示

Id              cTime                   latestTime              service
3178    2022-03-03T20:39:52.889Z    2022-03-03T20:39:58.601Z    mobile
3178    2022-03-03T20:39:52.889Z    2022-03-03T20:39:59.012Z    mobile
3240    2022-03-03T20:39:59.140Z    2022-03-03T20:39:59.220Z    mobile
3246    2022-03-03T20:40:00.615Z    2022-03-03T20:40:00.648Z    mobile
.
.
.

我们如何用第 2 行覆盖第 1 行，使用键作为 [“id”]，其中“latestTime”列，如何仅保留最新时间消息。

有没有实时的方法，如果没有，我们如何至少每小时检查一次，用新的

最终输出替换旧消息

Id              cTime                   latestTime              service
3178    2022-03-03T20:39:52.889Z    2022-03-03T20:39:59.012Z    mobile
3240    2022-03-03T20:39:59.140Z    2022-03-03T20:39:59.220Z    mobile
3246    2022-03-03T20:40:00.615Z    2022-03-03T20:40:00.648Z    mobile
.
.
.
.

原文

I am consuming data from a topic and as we know we get data real time, where we see repeated elements, How can actually replace old message with latest message.

I am using the following same code to consume from a topic

schema = StructType(
    [
        StructField("Id",StringType(),True),
        StructField("cTime",StringType(),True),
        StructField("latestTime",StringType(),False),
        StructField("service",StringType(),True),
    ]

topic = "topic1"
bootstrap_servers = "mrdc.it.com:9093,mrdc.it.com:9093,mrdc.it.com:9093"

options = {
    "kafka.sasl.jaas.config": 'org.apache.kafka.common.security.plain.PlainLoginModule required username="xxxxx.aud.com" password="xxxxxxxx";',\
    "kafka.ssl.ca.location": "/tmp/cert.crt",\
    "kafka.sasl.mechanism": "PLAIN",\
    "kafka.security.protocol" : "SASL_SSL",\
    "kafka.bootstrap.servers": bootstrap_servers,\
    "failOnDataLoss": "false",\
    "subscribe": topic,\
    "startingOffsets": "latest",\
    "enable.auto.commit": "false",\
    "auto.offset.reset": "false",\
    "enable.partition.eof": "true",\
    "key.deserializer": "org.apache.kafka.common.serialization.StringDeserializer",\
    "value.deserializer": "org.apache.kafka.common.serialization.StringDeserializer"
}
kafka_df = spark.readStream.format("kafka").options(**options).load()

kafka_mobile_apps_df = kafka_df.select(from_json(col("value").cast("string"), schema).alias("apps"))

df = avertack_kafka_eventhub_connections(source= "KAFKA", kafka_config=kafka_config)

sql_features = ["apps.Id",
                "apps.cTime",
                "apps.latesTime", 
                "apps.service"
               ]

kafka_df_features = df.selectExpr(sql_features)
display(kafka_df_features)

The output is as shown

Id              cTime                   latestTime              service
3178    2022-03-03T20:39:52.889Z    2022-03-03T20:39:58.601Z    mobile
3178    2022-03-03T20:39:52.889Z    2022-03-03T20:39:59.012Z    mobile
3240    2022-03-03T20:39:59.140Z    2022-03-03T20:39:59.220Z    mobile
3246    2022-03-03T20:40:00.615Z    2022-03-03T20:40:00.648Z    mobile
.
.
.

How can we overwrite row 1 with row 2, using keys as ["id"] where "latestTime" column, how to keep only the latest time message.

Is there any approach in real time, if not how can we at least check once in a hour to replace the old messages with new

final output

Id              cTime                   latestTime              service
3178    2022-03-03T20:39:52.889Z    2022-03-03T20:39:59.012Z    mobile
3240    2022-03-03T20:39:59.140Z    2022-03-03T20:39:59.220Z    mobile
3246    2022-03-03T20:40:00.615Z    2022-03-03T20:40:00.648Z    mobile
.
.
.
.

分享到QQ

分享到微博