当阅读到SPARK 3.3.0时

发布于 2025-02-13 03:47:43 字数 2712 浏览 0 评论 0原文

我从CSV文件中读取数据，并提供手工制作的架构：

new StructType(new StructField[] {
    new StructField("id", LongType, false, Metadata.empty(),
    new StructField("foo", IntegerType, false, Metadata.empty(),
    new StructField("bar", DateType, true, Metadata.empty()) });

打印模式显示：

root
 |-- id: long (nullable = false)
 |-- foo: integer (nullable = false)
 |-- bar: date (nullable = true)

并使用此代码将其写入镶木quet文件...

df.write().format("parquet").save("data.parquet");

...生成此日志消息：

INFO : o.a.s.s.e.d.p.ParquetWriteSupport: Initialized Parquet WriteSupport with Catalyst schema:
{
  "type" : "struct",
  "fields" : [ {
    "name" : "id",
    "type" : "long",
    "nullable" : false,
    "metadata" : { }
  }, {
    "name" : "foo",
    "type" : "integer",
    "nullable" : false,
    "metadata" : { }
  }, {
    "name" : "bar",
    "type" : "date",
    "nullable" : true,
    "metadata" : { }
  } ]
}
and corresponding Parquet message type:
message spark_schema {
  required int64 id;
  required int32 foo;
  optional int32 bar (DATE);
}

在那里看起来都不错。

但是，如果然后使用此代码读取该镶木quet文件：

Dataset<Row> read = spark.read().format("parquet").load("data.parquet");

...并打印架构，我会得到：

root
 |-- id: long (nullable = true)
 |-- foo: integer (nullable = true)
 |-- bar: date (nullable = true)

如上所述，所有列都变得无效 - 原始架构中指定的非删除性已丢失。

现在，如果我们查看加载过程中输出的一些调试，则表明Spark可以正确识别无效性。（我添加了新线以使其更具可读性）：

FileMetaData(
  version:1, 
  schema:[SchemaElement(name:spark_schema, num_children:4), 
  SchemaElement(type:INT64, repetition_type:REQUIRED, name:id), 
  SchemaElement(type:INT32, repetition_type:REQUIRED, name:foo), 
  SchemaElement(type:INT32, repetition_type:OPTIONAL, name:bar, converted_type:DATE, logicalType:<LogicalType DATE:DateType()>)], 
  num_rows:7, 
  row_groups:null, 
  key_value_metadata:
      [
        KeyValue(key:org.apache.spark.version, value:3.3.0), 
        KeyValue(
          key:org.apache.spark.sql.parquet.row.metadata, 
          value:{
            "type":"struct",
            "fields":
                [
                  {"name":"id","type":"long","nullable":false,"metadata":{}},
                  {"name":"foo","type":"integer","nullable":false,"metadata":{}},
                  {"name":"bar","type":"date","nullable":true,"metadata":{}}
                ]
          })
        ], 
  created_by:parquet-mr version 1.12.2 (build 77e30c8093386ec52c3cfa6c34b7ef3321322c94))

问题是：为什么（在哪里）丢失了非删除性？而且，在阅读镶木quet文件中，如何确保正确保留此无效信息？

（请注意，在我的真实用例中，我不能再次将架构贴上架构，我需要将其携带在镶木quet文件中并在读取上正确重构）。

原文

I read data from a CSV file, and supply a hand-crafted schema:

new StructType(new StructField[] {
    new StructField("id", LongType, false, Metadata.empty(),
    new StructField("foo", IntegerType, false, Metadata.empty(),
    new StructField("bar", DateType, true, Metadata.empty()) });

Printing the schema shows:

root
 |-- id: long (nullable = false)
 |-- foo: integer (nullable = false)
 |-- bar: date (nullable = true)

And writing it to a parquet file using this code ...

df.write().format("parquet").save("data.parquet");

... generates this log message:

INFO : o.a.s.s.e.d.p.ParquetWriteSupport: Initialized Parquet WriteSupport with Catalyst schema:
{
  "type" : "struct",
  "fields" : [ {
    "name" : "id",
    "type" : "long",
    "nullable" : false,
    "metadata" : { }
  }, {
    "name" : "foo",
    "type" : "integer",
    "nullable" : false,
    "metadata" : { }
  }, {
    "name" : "bar",
    "type" : "date",
    "nullable" : true,
    "metadata" : { }
  } ]
}
and corresponding Parquet message type:
message spark_schema {
  required int64 id;
  required int32 foo;
  optional int32 bar (DATE);
}

All looks good there.

However, if I then read in that parquet file using this code:

Dataset<Row> read = spark.read().format("parquet").load("data.parquet");

... and print the schema, I get:

root
 |-- id: long (nullable = true)
 |-- foo: integer (nullable = true)
 |-- bar: date (nullable = true)

As can be seen above, all columns have become nullable - the non-nullability specified in the original schema has been lost.

Now, if we take a look at some of the debug that is output during the load, it shows that Spark is correctly identifying the nullability. (I've added newlines to make this more readable):

FileMetaData(
  version:1, 
  schema:[SchemaElement(name:spark_schema, num_children:4), 
  SchemaElement(type:INT64, repetition_type:REQUIRED, name:id), 
  SchemaElement(type:INT32, repetition_type:REQUIRED, name:foo), 
  SchemaElement(type:INT32, repetition_type:OPTIONAL, name:bar, converted_type:DATE, logicalType:<LogicalType DATE:DateType()>)], 
  num_rows:7, 
  row_groups:null, 
  key_value_metadata:
      [
        KeyValue(key:org.apache.spark.version, value:3.3.0), 
        KeyValue(
          key:org.apache.spark.sql.parquet.row.metadata, 
          value:{
            "type":"struct",
            "fields":
                [
                  {"name":"id","type":"long","nullable":false,"metadata":{}},
                  {"name":"foo","type":"integer","nullable":false,"metadata":{}},
                  {"name":"bar","type":"date","nullable":true,"metadata":{}}
                ]
          })
        ], 
  created_by:parquet-mr version 1.12.2 (build 77e30c8093386ec52c3cfa6c34b7ef3321322c94))

The question is then: why (and where) is the non-nullability being lost? And, how can I ensure that this nullability information is correctly preserved when reading in the parquet file?

(Note that in my real use-case, I can't just hand-apply the schema again, I need it to be carried in the parquet file and correctly reconstituted on the read).

分享到QQ

分享到微博