Google-cloud-Storagorage“文件夹”看起来像一个物体（火花产生的守头）

发布于 2025-01-31 19:55:51 字数 2258 浏览 3 评论 0原文

tl; dr - 我有一个“文件夹”（从技术上讲没有文件夹）看起来像一个实际对象，大小为零。

看起来如何？

请参阅gsutil的长列表的输出（一些已编辑和更改的名称）：

╰─$ gsutil ls -l "gs://.../hive-warehouse/some.db/test/partDate=2022-05-24/mainType=entity/**"
         0  2022-05-24T00:51:37Z  gs://.../hive-warehouse/some.db/test/partDate=2022-05-24/mainType=entity/
    450940  2022-05-24T00:51:38Z  gs://.../hive-warehouse/some.db/test/partDate=2022-05-24/mainType=entity/mainId=123/part-00041-....c000.csv.gz
    226889  2022-05-24T00:51:37Z  gs://.../hive-warehouse/some.db/test/partDate=2022-05-24/mainType=entity/mainId=456/part-00012-....c000.csv.gz
TOTAL: 3 objects, 677829 bytes (661.94 KiB)

文件夹本身是不是应该列为对象的。这不是列出其他文件夹时的行为。

进一步使用GSUTIL的“较长”列表产生以下内容：

╰─$ gsutil ls -L gs://.../hive-warehouse/some.db/test/partDate=2022-05-24/mainType=entity/
gs://.../hive-warehouse/some.db/test/partDate=2022-05-24/mainType=entity/:
    Creation time:          Tue, 24 May 2022 00:51:37 GMT
    Update time:            Tue, 24 May 2022 00:51:37 GMT
    Storage class:          STANDARD
    Content-Length:         0
    Content-Type:           application/octet-stream
    Hash (crc32c):          AAAAAA==
    Hash (md5):             2B2M2Y8AsgTpgAmY7PhCfg==
    ETag:                   CIzupN/aaBBcEAE=
    Generation:             1653351117573132
    Metageneration:         1
    ACL:                    []
                                 gs://.../hive-warehouse/some.db/test/partDate=2022-05-24/mainType=entity/mainId=123/
                                 gs://.../hive-warehouse/some.db/test/partDate=2022-05-24/mainType=entity/mainId=456/
TOTAL: 1 objects, 0 bytes (0 B)

生成的文件夹

我们正在使用Google的托管Spark Cluster（DataProc）与随附的托管Hive Metastore一起。以下代码创建了test表：

      // someDf is a DataFrame
      someDf.write.mode(SaveMode.Overwrite)
        .format("csv")
        .partitionBy("partDate", "mainType", "mainId")
        .option("compression", "gzip")
        .option("header", value = true)
        .saveAsTable("test")

从上面的所有内容中尝试结论，

Spark似乎已经创建了一个与maintype分区相同的空对象分割）。我不确定上述是否意味着什么，或者还能做些什么。很想听听一些专家（Spark或GCP）。

原文

TL;DR - I have a "folder" (I'm aware that technically there are no folders) that looks like an actual object, with zero size.

How does it look like?

See the output of gsutil's long listing (some names redacted and changed):

╰─$ gsutil ls -l "gs://.../hive-warehouse/some.db/test/partDate=2022-05-24/mainType=entity/**"
         0  2022-05-24T00:51:37Z  gs://.../hive-warehouse/some.db/test/partDate=2022-05-24/mainType=entity/
    450940  2022-05-24T00:51:38Z  gs://.../hive-warehouse/some.db/test/partDate=2022-05-24/mainType=entity/mainId=123/part-00041-....c000.csv.gz
    226889  2022-05-24T00:51:37Z  gs://.../hive-warehouse/some.db/test/partDate=2022-05-24/mainType=entity/mainId=456/part-00012-....c000.csv.gz
TOTAL: 3 objects, 677829 bytes (661.94 KiB)

The folder itself was not supposed to be listed as an object. This isn't the behavior when listing other folders.

Further using the "longer" listing of gsutil produces the following:

╰─$ gsutil ls -L gs://.../hive-warehouse/some.db/test/partDate=2022-05-24/mainType=entity/
gs://.../hive-warehouse/some.db/test/partDate=2022-05-24/mainType=entity/:
    Creation time:          Tue, 24 May 2022 00:51:37 GMT
    Update time:            Tue, 24 May 2022 00:51:37 GMT
    Storage class:          STANDARD
    Content-Length:         0
    Content-Type:           application/octet-stream
    Hash (crc32c):          AAAAAA==
    Hash (md5):             2B2M2Y8AsgTpgAmY7PhCfg==
    ETag:                   CIzupN/aaBBcEAE=
    Generation:             1653351117573132
    Metageneration:         1
    ACL:                    []
                                 gs://.../hive-warehouse/some.db/test/partDate=2022-05-24/mainType=entity/mainId=123/
                                 gs://.../hive-warehouse/some.db/test/partDate=2022-05-24/mainType=entity/mainId=456/
TOTAL: 1 objects, 0 bytes (0 B)

What generated that folder

We are using Google's managed Spark cluster (Dataproc) with the managed hive metastore that comes with it.
The following code created the test table:

      // someDf is a DataFrame
      someDf.write.mode(SaveMode.Overwrite)
        .format("csv")
        .partitionBy("partDate", "mainType", "mainId")
        .option("compression", "gzip")
        .option("header", value = true)
        .saveAsTable("test")

Attempted conclusion

From all the above it seems like Spark has created an empty object with the same name as the mainType partition (and only for that partition).
I'm not sure if the above means anything, or what else to make of it.
Would love to hear some experts (either Spark or GCP).

分享到QQ

分享到微博