Flink缺少窗户处理器（事件时间窗口）和Kafka源

发布于 2025-01-22 20:48:01 字数 2038 浏览 2 评论 0原文

我们有一个流媒体作业，具有20个单独的管道，每个管道都有一个/许多Kafka主题来源，并且一些管道带有窗户处理器，而其他管道则是未窗口的处理器。

当作业下降时，我们注意到窗户处理器管道的数据丢失，并且需要一些时间才能恢复/何时需要重新启动作业。

我已经为所有操作员设置了UID，并且可以在日志中看到，从kafka消费者操作员的保存点还原偏移量
我们使用boundedOutOutOutOutOutOutOutOutOutOutOutOutOutOutOutOutOutOutOutOutOutOutOutOutOutOutOutOutOffordersTimestAmpAmpExtractor来根据事件时间分配水印。
。

public class KafkaEventTimestampExtractor extends BoundedOutOfOrdernessTimestampExtractor<Event> implements Serializable{

    public KafkaEventTimestampExtractor(Time maxOutOfOrderness) {
        super(maxOutOfOrderness);
    }

    @Override
    public long extractTimestamp(Event element) {
        try {
            log.info("event to be processed, event:{}", new ObjectMapper().writeValueAsString(element));
        } catch (JsonProcessingException e) {
            e.printStackTrace();
        }
        Long ts = null;
        ts = Double.valueOf(Double.parseDouble(element.getTs())).longValue();
        ts = ts.toString().length() < 13 ? ts * 1000 : ts;
        return ts;
    }
}

管道配置看起来像这样。

非窗户的

SourceUtil
  .getEventDataStream(env, kafkaSourceSet)
  .process(new S3EventProcessor()).uid(“…..**)
  .addSink();

窗口

SourceUtil
  .getEventDataStream(env, kafkaSourceSet)
  .assignTimestampsAndWatermarks(
    new KafkaEventTimestampExtractor(Time.seconds(4)))
  .windowAll(TumblingEventTimeWindows.of(
    Time.milliseconds(kafkaSourceSet.bufferWindowSize))
  .process(new S3EventProcessor()).uid(“…..**)
  .addSink();

说，工作是30分钟的下降，在这种情况下，我们不使用窗口处理器不会错过任何数据，但是在那30分钟内从窗户处理器中丢失了Paritial Data。
。
当我们增加timeWinows的序列外事件延迟时，即我们将其从4sec增加到30分钟，如果应用程序在30分钟内，则不会错过这些事件。我们远不及解决方案。对于我们来说，超过1分钟的延迟也不可见，也会有太多的现场窗户对我们来说意味着巨大的下属变化。

原文

We have a Streaming Job that has 20 separate pipelines, with each pipeline having one/many Kafka topic sources and with some pipelines having Windowed Processor and others being a Non-Windowed Processor.

We are noticing data loss for Windowed Processor pipelines when the job goes down and takes some time to recover/when the job needs to be restarted.

I have set UID for all of the Operators and I can see in logs that offsets are being restored from savepoint for the Kafka consumer operator
we are using BoundedOutOfOrdernessTimestampExtractor to Assign watermarks based on event time.

public class KafkaEventTimestampExtractor extends BoundedOutOfOrdernessTimestampExtractor<Event> implements Serializable{

    public KafkaEventTimestampExtractor(Time maxOutOfOrderness) {
        super(maxOutOfOrderness);
    }

    @Override
    public long extractTimestamp(Event element) {
        try {
            log.info("event to be processed, event:{}", new ObjectMapper().writeValueAsString(element));
        } catch (JsonProcessingException e) {
            e.printStackTrace();
        }
        Long ts = null;
        ts = Double.valueOf(Double.parseDouble(element.getTs())).longValue();
        ts = ts.toString().length() < 13 ? ts * 1000 : ts;
        return ts;
    }
}

Pipeline Config looks something like this.

NON-WINDOWED

SourceUtil
  .getEventDataStream(env, kafkaSourceSet)
  .process(new S3EventProcessor()).uid(“…..**)
  .addSink();

WINDOWED

SourceUtil
  .getEventDataStream(env, kafkaSourceSet)
  .assignTimestampsAndWatermarks(
    new KafkaEventTimestampExtractor(Time.seconds(4)))
  .windowAll(TumblingEventTimeWindows.of(
    Time.milliseconds(kafkaSourceSet.bufferWindowSize))
  .process(new S3EventProcessor()).uid(“…..**)
  .addSink();

Lets say job is down 30 min, in that case pipeline where we do not use window processor does not miss any data but paritial data is missed from the windowed processor for those 30 min.
when we increase the out-of-order events delay in TimeWinows, ie- we increased It to 30min from 4sec, then the events are not getting missed if the application is up within 30min.we are getting nowhere near the solution since the delay of more than 1 min is infeasible for us also there will be too many live windows which will mean huge infra change for us.

分享到QQ

分享到微博