Apache Flink StreamingFileSink 在写入 S3 时发出多个 HEAD 请求，这会导致速率限制

发布于 2025-01-14 15:42:15 字数 1589 浏览 6 评论 0原文

我有一个 Apache Flink 应用程序，已部署在 Kinesis Data Analytics 上。

该应用程序从 Kafka 读取数据并将其写入 S3。它写入的 S3 存储桶结构是使用 BucketAssigner 计算的。BucketAssigner 的精简版本这里

我遇到的问题是，假设我们必须写入这个目录结构： s3://myBucket/folder1/folder2/folder3/myFile.json

在发出 PUT 请求之前，它会发出以下 HEAD 请求：

HEAD /folder1
HEAD /folder1/folder2
HEAD /folder1/folder2/folder3/

然后它使PUT 请求。

它对每个请求都执行此操作，这导致 S3 速率限制，并在我的 FLink 应用程序中产生背压。

我发现有人对 BucketingSink 有类似的问题： https://lists.apache.org/thread/rbp2gdbxwdrk7zmvwhd2bw56mlwokpzz

提到的解决方案是切换到StreamingFileSink 这就是我正在做的事情。

关于如何在 StreamingFileSink 中解决此问题有什么想法吗？

我的 SinkConfig 如下：

StreamingFileSink
  .forRowFormat(new Path(s3Bucket), new JsonEncoder<>())
  .withBucketAssigner(bucketAssigner)
  .withRollingPolicy(DefaultRollingPolicy.builder()
                .withRolloverInterval(60000)
                .build())
  .build()

JsonEncoder 获取对象并将其转换为 json 并写出像 this

我已经在这个问题中描述了有关整个管道如何工作的更多细节（如果这有帮助的话）：沉重的背压和巨大的检查点大小

原文

I have an Apache Flink application that I have deployed on Kinesis Data analytics.

This application reads from Kafka and writes to S3. The S3 bucket structure it writes to is computed using a BucketAssigner.A stripped down version of the BucketAssigner here

The problem I have is, let us say we have to write to this directory structure: s3://myBucket/folder1/folder2/folder3/myFile.json

Before making the PUT request, it makes a the following HEAD requests:

HEAD /folder1
HEAD /folder1/folder2
HEAD /folder1/folder2/folder3/

And then it makes the PUT request.

It is doing it for each and every request, which is causing S3 rate limiting and there by backpressure in my FLink application.

I found that someone had a similar issue with BucketingSink: https://lists.apache.org/thread/rbp2gdbxwdrk7zmvwhd2bw56mlwokpzz

The solution mentioned there was to switch to StreamingFileSink which is what I am doing .

Any ideas on how to fix this in StreamingFileSink?

My SinkConfig is as follows:

StreamingFileSink
  .forRowFormat(new Path(s3Bucket), new JsonEncoder<>())
  .withBucketAssigner(bucketAssigner)
  .withRollingPolicy(DefaultRollingPolicy.builder()
                .withRolloverInterval(60000)
                .build())
  .build()

The JsonEncoder takes the object and converts it to json and writes out bytes like this

I have described more details about how the whole pipeline works in this question if that helps in anyway: Heavy back pressure and huge checkpoint size

分享到QQ

分享到微博