Google-cloud-Storagorage“文件夹”看起来像一个物体(火花产生的守头)
tl; dr - 我有一个“文件夹”(从技术上讲没有文件夹)看起来像一个实际对象,大小为零。
看起来如何?
请参阅gsutil
的长列表的输出(一些已编辑和更改的名称):
╰─$ gsutil ls -l "gs://.../hive-warehouse/some.db/test/partDate=2022-05-24/mainType=entity/**"
0 2022-05-24T00:51:37Z gs://.../hive-warehouse/some.db/test/partDate=2022-05-24/mainType=entity/
450940 2022-05-24T00:51:38Z gs://.../hive-warehouse/some.db/test/partDate=2022-05-24/mainType=entity/mainId=123/part-00041-....c000.csv.gz
226889 2022-05-24T00:51:37Z gs://.../hive-warehouse/some.db/test/partDate=2022-05-24/mainType=entity/mainId=456/part-00012-....c000.csv.gz
TOTAL: 3 objects, 677829 bytes (661.94 KiB)
文件夹本身是不是应该列为对象的。这不是列出其他文件夹时的行为。
进一步使用GSUTIL的“较长”列表产生以下内容:
╰─$ gsutil ls -L gs://.../hive-warehouse/some.db/test/partDate=2022-05-24/mainType=entity/
gs://.../hive-warehouse/some.db/test/partDate=2022-05-24/mainType=entity/:
Creation time: Tue, 24 May 2022 00:51:37 GMT
Update time: Tue, 24 May 2022 00:51:37 GMT
Storage class: STANDARD
Content-Length: 0
Content-Type: application/octet-stream
Hash (crc32c): AAAAAA==
Hash (md5): 2B2M2Y8AsgTpgAmY7PhCfg==
ETag: CIzupN/aaBBcEAE=
Generation: 1653351117573132
Metageneration: 1
ACL: []
gs://.../hive-warehouse/some.db/test/partDate=2022-05-24/mainType=entity/mainId=123/
gs://.../hive-warehouse/some.db/test/partDate=2022-05-24/mainType=entity/mainId=456/
TOTAL: 1 objects, 0 bytes (0 B)
生成的文件夹
我们正在使用Google的托管Spark Cluster(DataProc)与随附的托管Hive Metastore一起 。 以下代码创建了test
表:
// someDf is a DataFrame
someDf.write.mode(SaveMode.Overwrite)
.format("csv")
.partitionBy("partDate", "mainType", "mainId")
.option("compression", "gzip")
.option("header", value = true)
.saveAsTable("test")
从上面的所有内容中尝试结论,
Spark似乎已经创建了一个与maintype
分区相同的空对象分割)。 我不确定上述是否意味着什么,或者还能做些什么。 很想听听一些专家(Spark或GCP)。
TL;DR - I have a "folder" (I'm aware that technically there are no folders) that looks like an actual object, with zero size.
How does it look like?
See the output of gsutil
's long listing (some names redacted and changed):
╰─$ gsutil ls -l "gs://.../hive-warehouse/some.db/test/partDate=2022-05-24/mainType=entity/**"
0 2022-05-24T00:51:37Z gs://.../hive-warehouse/some.db/test/partDate=2022-05-24/mainType=entity/
450940 2022-05-24T00:51:38Z gs://.../hive-warehouse/some.db/test/partDate=2022-05-24/mainType=entity/mainId=123/part-00041-....c000.csv.gz
226889 2022-05-24T00:51:37Z gs://.../hive-warehouse/some.db/test/partDate=2022-05-24/mainType=entity/mainId=456/part-00012-....c000.csv.gz
TOTAL: 3 objects, 677829 bytes (661.94 KiB)
The folder itself was not supposed to be listed as an object. This isn't the behavior when listing other folders.
Further using the "longer" listing of gsutil produces the following:
╰─$ gsutil ls -L gs://.../hive-warehouse/some.db/test/partDate=2022-05-24/mainType=entity/
gs://.../hive-warehouse/some.db/test/partDate=2022-05-24/mainType=entity/:
Creation time: Tue, 24 May 2022 00:51:37 GMT
Update time: Tue, 24 May 2022 00:51:37 GMT
Storage class: STANDARD
Content-Length: 0
Content-Type: application/octet-stream
Hash (crc32c): AAAAAA==
Hash (md5): 2B2M2Y8AsgTpgAmY7PhCfg==
ETag: CIzupN/aaBBcEAE=
Generation: 1653351117573132
Metageneration: 1
ACL: []
gs://.../hive-warehouse/some.db/test/partDate=2022-05-24/mainType=entity/mainId=123/
gs://.../hive-warehouse/some.db/test/partDate=2022-05-24/mainType=entity/mainId=456/
TOTAL: 1 objects, 0 bytes (0 B)
What generated that folder
We are using Google's managed Spark cluster (Dataproc) with the managed hive metastore that comes with it.
The following code created the test
table:
// someDf is a DataFrame
someDf.write.mode(SaveMode.Overwrite)
.format("csv")
.partitionBy("partDate", "mainType", "mainId")
.option("compression", "gzip")
.option("header", value = true)
.saveAsTable("test")
Attempted conclusion
From all the above it seems like Spark has created an empty object with the same name as the mainType
partition (and only for that partition).
I'm not sure if the above means anything, or what else to make of it.
Would love to hear some experts (either Spark or GCP).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论