是否有一种使用Pyspark编写许多小文件的最佳方法？

发布于 2025-02-08 20:25:04 字数 303 浏览 2 评论 0原文

我有一个工作，需要在Spark DataFrame中为每行写一个JSON文件，以将其写入S3（然后由另一个过程拾取）。

df.repartition(col("id")).write.mode("overwrite").partitionBy(col("id")).json(
        f"s3://bucket/path/to/file"
    )

这些数据集通常由100k行（有时为1m+）组成，并需要很长时间才能写作。我知道大量的小文件并不适合阅读性能，但是写作也是如此吗？还是可以通过分区来加快速度来做些事情？

原文

I have a job that requires having to write a single JSON file to s3 for each row in a Spark dataframe (which then gets picked up by another process).

df.repartition(col("id")).write.mode("overwrite").partitionBy(col("id")).json(
        f"s3://bucket/path/to/file"
    )

These datasets often consist of 100k rows (sometimes 1m+) and take a very long time to write. I understand that large numbers of small files is not great for read performance but is this also the case for writes? Or is there something that can be done with partitioning to speed things up?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

败给现实 2025-02-15 20:25:04

请不要这样做，您只会遭受痛苦。 S3旨在针对大型文件进行优化的廉价长期存储空间。它是设计的，因此“前缀”（目录路径）会导致提供文件的存储桶。如果您想优化阅读和写入，则要开发几个存储桶以同时写入。这意味着您要实际将目录路径（前缀）实际修改为具有最大变化量的存储桶，以增加您写入的存储桶数量。

Mulitple文件被写入同一存储桶的示例：

  S3:/mydrive/mystuff/2020-12-31
  S3:/mydrive/mystuff/2020-12-30
  S3:/mydrive/mystuff/2020-12-29

这是因为它们所有人共享相同的桶前缀 - ＆gt; s3：/mydrive/mystuff/
如果您翻转改变的部分怎么办？现在，当您写作不同的存储桶时，您将使用不同的存储桶。 “>前缀是不同的）

  S3:2020-12-31/mydrive/mystuff/
  S3:2020-12-30/mydrive/mystuff/
  S3:2020-12-29/mydrive/mystuff/

此更改将有助于读/写速，因为将使用不同的存储桶。它不会解决S3实际上没有使用目录将您引导到文件的问题。正如我说的那样，前缀实际上只是指向水桶的指针。然后，它针对您编写的所有文件进行搜索，以查找存储桶中存在的文件。这就是为什么大量的小文件会使情况变得更糟，文件的查找时间越来越多，您编写的文件越多。因为此查找价格昂贵，因此编写较大的文件并使查找成本最小化要快得多。

Please don't do this, you will only suffer pain. S3 was designed to be cheap long-term storage optimized for large files. It was design so the 'prefix' (directory path) leads to a bucket that provides files. If you want to optimize reads and writes you want to develop several buckets to write to at the same time. This means you want to actually modify the directory path(prefix) to the bucket with the most amount of variation to increase the number of buckets that you write to.

Example of mulitple files being written to the same bucket:

  S3:/mydrive/mystuff/2020-12-31
  S3:/mydrive/mystuff/2020-12-30
  S3:/mydrive/mystuff/2020-12-29

This is because they all share the same bucket prefix --> S3:/mydrive/mystuff/
What if instead you flipped the part that changes? Now you have different buckets being used as you are writing to different buckets.(prefix is different)

  S3:2020-12-31/mydrive/mystuff/
  S3:2020-12-30/mydrive/mystuff/
  S3:2020-12-29/mydrive/mystuff/

This change will help with read/write speed as different buckets will be used. It will not solve the problem that S3 doesn't actually use directories to direct you to files. As I said a prefix is actually just a pointer to the bucket. It then searches against all files you have written, to find the file that exists in your bucket. This is why tons of small files makes things worse, the lookup time for files takes longer and longer the more files you write. Because this lookup is expensive it's much faster to write larger files and make the cost of lookup minimized.

回复收藏 0 原文

~没有更多了~