当阅读到SPARK 3.3.0时

发布于 2025-02-13 03:47:43 字数 2712 浏览 0 评论 0原文

我从CSV文件中读取数据,并提供手工制作的架构:

new StructType(new StructField[] {
    new StructField("id", LongType, false, Metadata.empty(),
    new StructField("foo", IntegerType, false, Metadata.empty(),
    new StructField("bar", DateType, true, Metadata.empty()) });

打印模式显示:

root
 |-- id: long (nullable = false)
 |-- foo: integer (nullable = false)
 |-- bar: date (nullable = true)

并使用此代码将其写入镶木quet文件...

df.write().format("parquet").save("data.parquet");

...生成此日志消息:

INFO : o.a.s.s.e.d.p.ParquetWriteSupport: Initialized Parquet WriteSupport with Catalyst schema:
{
  "type" : "struct",
  "fields" : [ {
    "name" : "id",
    "type" : "long",
    "nullable" : false,
    "metadata" : { }
  }, {
    "name" : "foo",
    "type" : "integer",
    "nullable" : false,
    "metadata" : { }
  }, {
    "name" : "bar",
    "type" : "date",
    "nullable" : true,
    "metadata" : { }
  } ]
}
and corresponding Parquet message type:
message spark_schema {
  required int64 id;
  required int32 foo;
  optional int32 bar (DATE);
}

在那里看起来都不错。

但是,如果然后使用此代码读取该镶木quet文件:

Dataset<Row> read = spark.read().format("parquet").load("data.parquet");

...并打印架构,我会得到:

root
 |-- id: long (nullable = true)
 |-- foo: integer (nullable = true)
 |-- bar: date (nullable = true)

如上所述,所有列都变得无效 - 原始架构中指定的非删除性已丢失。

现在,如果我们查看加载过程中输出的一些调试,则表明Spark可以正确识别无效性。 (我添加了新线以使其更具可读性):

FileMetaData(
  version:1, 
  schema:[SchemaElement(name:spark_schema, num_children:4), 
  SchemaElement(type:INT64, repetition_type:REQUIRED, name:id), 
  SchemaElement(type:INT32, repetition_type:REQUIRED, name:foo), 
  SchemaElement(type:INT32, repetition_type:OPTIONAL, name:bar, converted_type:DATE, logicalType:<LogicalType DATE:DateType()>)], 
  num_rows:7, 
  row_groups:null, 
  key_value_metadata:
      [
        KeyValue(key:org.apache.spark.version, value:3.3.0), 
        KeyValue(
          key:org.apache.spark.sql.parquet.row.metadata, 
          value:{
            "type":"struct",
            "fields":
                [
                  {"name":"id","type":"long","nullable":false,"metadata":{}},
                  {"name":"foo","type":"integer","nullable":false,"metadata":{}},
                  {"name":"bar","type":"date","nullable":true,"metadata":{}}
                ]
          })
        ], 
  created_by:parquet-mr version 1.12.2 (build 77e30c8093386ec52c3cfa6c34b7ef3321322c94))

问题是:为什么(在哪里)丢失了非删除性?而且,在阅读镶木quet文件中,如何确保正确保留此无效信息?

(请注意,在我的真实用例中,我不能再次将架构贴上架构,我需要将其携带在镶木quet文件中并在读取上正确重构)。

I read data from a CSV file, and supply a hand-crafted schema:

new StructType(new StructField[] {
    new StructField("id", LongType, false, Metadata.empty(),
    new StructField("foo", IntegerType, false, Metadata.empty(),
    new StructField("bar", DateType, true, Metadata.empty()) });

Printing the schema shows:

root
 |-- id: long (nullable = false)
 |-- foo: integer (nullable = false)
 |-- bar: date (nullable = true)

And writing it to a parquet file using this code ...

df.write().format("parquet").save("data.parquet");

... generates this log message:

INFO : o.a.s.s.e.d.p.ParquetWriteSupport: Initialized Parquet WriteSupport with Catalyst schema:
{
  "type" : "struct",
  "fields" : [ {
    "name" : "id",
    "type" : "long",
    "nullable" : false,
    "metadata" : { }
  }, {
    "name" : "foo",
    "type" : "integer",
    "nullable" : false,
    "metadata" : { }
  }, {
    "name" : "bar",
    "type" : "date",
    "nullable" : true,
    "metadata" : { }
  } ]
}
and corresponding Parquet message type:
message spark_schema {
  required int64 id;
  required int32 foo;
  optional int32 bar (DATE);
}

All looks good there.

However, if I then read in that parquet file using this code:

Dataset<Row> read = spark.read().format("parquet").load("data.parquet");

... and print the schema, I get:

root
 |-- id: long (nullable = true)
 |-- foo: integer (nullable = true)
 |-- bar: date (nullable = true)

As can be seen above, all columns have become nullable - the non-nullability specified in the original schema has been lost.

Now, if we take a look at some of the debug that is output during the load, it shows that Spark is correctly identifying the nullability. (I've added newlines to make this more readable):

FileMetaData(
  version:1, 
  schema:[SchemaElement(name:spark_schema, num_children:4), 
  SchemaElement(type:INT64, repetition_type:REQUIRED, name:id), 
  SchemaElement(type:INT32, repetition_type:REQUIRED, name:foo), 
  SchemaElement(type:INT32, repetition_type:OPTIONAL, name:bar, converted_type:DATE, logicalType:<LogicalType DATE:DateType()>)], 
  num_rows:7, 
  row_groups:null, 
  key_value_metadata:
      [
        KeyValue(key:org.apache.spark.version, value:3.3.0), 
        KeyValue(
          key:org.apache.spark.sql.parquet.row.metadata, 
          value:{
            "type":"struct",
            "fields":
                [
                  {"name":"id","type":"long","nullable":false,"metadata":{}},
                  {"name":"foo","type":"integer","nullable":false,"metadata":{}},
                  {"name":"bar","type":"date","nullable":true,"metadata":{}}
                ]
          })
        ], 
  created_by:parquet-mr version 1.12.2 (build 77e30c8093386ec52c3cfa6c34b7ef3321322c94))

The question is then: why (and where) is the non-nullability being lost? And, how can I ensure that this nullability information is correctly preserved when reading in the parquet file?

(Note that in my real use-case, I can't just hand-apply the schema again, I need it to be carried in the parquet file and correctly reconstituted on the read).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

美羊羊 2025-02-20 03:47:43

这是有记录的行为。来自 https://spark.apache.org.org/docs /3.3.0/sql-data-sources-parquet.html

Parquet是一种柱状格式,由许多其他数据处理系统支持。 Spark SQL提供了读取和编写镶木木材文件的支持,这些文件会自动保留原始数据的架构。 读取镶木quet文件时,由于兼容原因,所有列都会自动转换为无效。

This is a documented behaviour. From https://spark.apache.org/docs/3.3.0/sql-data-sources-parquet.html

Parquet is a columnar format that is supported by many other data processing systems. Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data. When reading Parquet files, all columns are automatically converted to be nullable for compatibility reasons.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文