当阅读到SPARK 3.3.0时
我从CSV文件中读取数据,并提供手工制作的架构:
new StructType(new StructField[] {
new StructField("id", LongType, false, Metadata.empty(),
new StructField("foo", IntegerType, false, Metadata.empty(),
new StructField("bar", DateType, true, Metadata.empty()) });
打印模式显示:
root
|-- id: long (nullable = false)
|-- foo: integer (nullable = false)
|-- bar: date (nullable = true)
并使用此代码将其写入镶木quet文件...
df.write().format("parquet").save("data.parquet");
...生成此日志消息:
INFO : o.a.s.s.e.d.p.ParquetWriteSupport: Initialized Parquet WriteSupport with Catalyst schema:
{
"type" : "struct",
"fields" : [ {
"name" : "id",
"type" : "long",
"nullable" : false,
"metadata" : { }
}, {
"name" : "foo",
"type" : "integer",
"nullable" : false,
"metadata" : { }
}, {
"name" : "bar",
"type" : "date",
"nullable" : true,
"metadata" : { }
} ]
}
and corresponding Parquet message type:
message spark_schema {
required int64 id;
required int32 foo;
optional int32 bar (DATE);
}
在那里看起来都不错。
但是,如果然后使用此代码读取该镶木quet文件:
Dataset<Row> read = spark.read().format("parquet").load("data.parquet");
...并打印架构,我会得到:
root
|-- id: long (nullable = true)
|-- foo: integer (nullable = true)
|-- bar: date (nullable = true)
如上所述,所有列都变得无效 - 原始架构中指定的非删除性已丢失。
现在,如果我们查看加载过程中输出的一些调试,则表明Spark可以正确识别无效性。 (我添加了新线以使其更具可读性):
FileMetaData(
version:1,
schema:[SchemaElement(name:spark_schema, num_children:4),
SchemaElement(type:INT64, repetition_type:REQUIRED, name:id),
SchemaElement(type:INT32, repetition_type:REQUIRED, name:foo),
SchemaElement(type:INT32, repetition_type:OPTIONAL, name:bar, converted_type:DATE, logicalType:<LogicalType DATE:DateType()>)],
num_rows:7,
row_groups:null,
key_value_metadata:
[
KeyValue(key:org.apache.spark.version, value:3.3.0),
KeyValue(
key:org.apache.spark.sql.parquet.row.metadata,
value:{
"type":"struct",
"fields":
[
{"name":"id","type":"long","nullable":false,"metadata":{}},
{"name":"foo","type":"integer","nullable":false,"metadata":{}},
{"name":"bar","type":"date","nullable":true,"metadata":{}}
]
})
],
created_by:parquet-mr version 1.12.2 (build 77e30c8093386ec52c3cfa6c34b7ef3321322c94))
问题是:为什么(在哪里)丢失了非删除性?而且,在阅读镶木quet文件中,如何确保正确保留此无效信息?
(请注意,在我的真实用例中,我不能再次将架构贴上架构,我需要将其携带在镶木quet文件中并在读取上正确重构)。
I read data from a CSV file, and supply a hand-crafted schema:
new StructType(new StructField[] {
new StructField("id", LongType, false, Metadata.empty(),
new StructField("foo", IntegerType, false, Metadata.empty(),
new StructField("bar", DateType, true, Metadata.empty()) });
Printing the schema shows:
root
|-- id: long (nullable = false)
|-- foo: integer (nullable = false)
|-- bar: date (nullable = true)
And writing it to a parquet file using this code ...
df.write().format("parquet").save("data.parquet");
... generates this log message:
INFO : o.a.s.s.e.d.p.ParquetWriteSupport: Initialized Parquet WriteSupport with Catalyst schema:
{
"type" : "struct",
"fields" : [ {
"name" : "id",
"type" : "long",
"nullable" : false,
"metadata" : { }
}, {
"name" : "foo",
"type" : "integer",
"nullable" : false,
"metadata" : { }
}, {
"name" : "bar",
"type" : "date",
"nullable" : true,
"metadata" : { }
} ]
}
and corresponding Parquet message type:
message spark_schema {
required int64 id;
required int32 foo;
optional int32 bar (DATE);
}
All looks good there.
However, if I then read in that parquet file using this code:
Dataset<Row> read = spark.read().format("parquet").load("data.parquet");
... and print the schema, I get:
root
|-- id: long (nullable = true)
|-- foo: integer (nullable = true)
|-- bar: date (nullable = true)
As can be seen above, all columns have become nullable - the non-nullability specified in the original schema has been lost.
Now, if we take a look at some of the debug that is output during the load, it shows that Spark is correctly identifying the nullability. (I've added newlines to make this more readable):
FileMetaData(
version:1,
schema:[SchemaElement(name:spark_schema, num_children:4),
SchemaElement(type:INT64, repetition_type:REQUIRED, name:id),
SchemaElement(type:INT32, repetition_type:REQUIRED, name:foo),
SchemaElement(type:INT32, repetition_type:OPTIONAL, name:bar, converted_type:DATE, logicalType:<LogicalType DATE:DateType()>)],
num_rows:7,
row_groups:null,
key_value_metadata:
[
KeyValue(key:org.apache.spark.version, value:3.3.0),
KeyValue(
key:org.apache.spark.sql.parquet.row.metadata,
value:{
"type":"struct",
"fields":
[
{"name":"id","type":"long","nullable":false,"metadata":{}},
{"name":"foo","type":"integer","nullable":false,"metadata":{}},
{"name":"bar","type":"date","nullable":true,"metadata":{}}
]
})
],
created_by:parquet-mr version 1.12.2 (build 77e30c8093386ec52c3cfa6c34b7ef3321322c94))
The question is then: why (and where) is the non-nullability being lost? And, how can I ensure that this nullability information is correctly preserved when reading in the parquet file?
(Note that in my real use-case, I can't just hand-apply the schema again, I need it to be carried in the parquet file and correctly reconstituted on the read).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这是有记录的行为。来自 https://spark.apache.org.org/docs /3.3.0/sql-data-sources-parquet.html
This is a documented behaviour. From https://spark.apache.org/docs/3.3.0/sql-data-sources-parquet.html