PySpark:MutableLong 无法转换为 MutableInt(数据帧中没有 long)
我正在尝试使用 boto3 中的 Glue 客户端从 PySpark 中的 Athena 读取 profiles
表并检查它是否为空。为什么 Spark 在将 Int 转换为 Long 时出现错误,知道我在读取的表中没有 Long 类型? Google 和 StackOverflow 上都没有任何内容可以回答这个问题。
这是代码总结:
dataframe = GlueContext(session.sparkContext).create_dynamic_frame.from_catalog(
database="xxx",
table_name="profiles",
catalog_id="xxx"
).toDF()
if dataframe.rdd.isEmpty():
dataframe = session.sparkContext.emptyRDD().toDF(schema)
我收到错误:
ERROR GlueExceptionAnalysisListener: [Glue Exception Analysis] Event: GlueETLJobExceptionEvent
[...]
File "/myScript.py", line 246, in load_table
if dataframe.rdd.isEmpty()
[...]
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file s3://bucket/path/to/profiles/vault=c27/subgroup=1/part-00003-a97d95f5-713c-4756-808b-38c3866842cb.c000.snappy.parquet
[...]
Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt
这是 Athena DDL:
CREATE EXTERNAL TABLE `profiles`(
`id` string,
`anonymousids` array<string>,
`lastconsentinsightusage` boolean,
`lastconsentactivationusage` boolean,
`gender` string,
`age` int,
`iata` string,
`continent` string,
`country` string,
`city` string,
`state` string,
`brandvisit` int,
`knownprofileinmarket` date,
`devicebrowser` string,
`devicebrand` string,
`deviceos` string,
`returnedinmarket` date)
PARTITIONED BY (
`vault` varchar(5),
`subgroup` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://bucket/path/to/profiles'
TBLPROPERTIES (
'classification'='parquet',
'transient_lastDdlTime'='1645604528')
这是镶木地板架构:
{
"type" : "record",
"name" : "spark_schema",
"fields" : [ {
"name" : "Id",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "anonymousIds",
"type" : [ "null", {
"type" : "array",
"items" : {
"type" : "record",
"name" : "list",
"fields" : [ {
"name" : "element",
"type" : [ "null", "string" ],
"default" : null
} ]
}
} ],
"default" : null
}, {
"name" : "lastConsentInsightUsage",
"type" : [ "null", "boolean" ],
"default" : null
}, {
"name" : "lastConsentActivationUsage",
"type" : [ "null", "boolean" ],
"default" : null
}, {
"name" : "gender",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "age",
"type" : [ "null", "int" ],
"default" : null
}, {
"name" : "iata",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "continent",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "country",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "city",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "state",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "brandVisit",
"type" : [ "null", "int" ],
"default" : null
}, {
"name" : "knownProfileInMarket",
"type" : [ "null", {
"type" : "int",
"logicalType" : "date"
} ],
"default" : null
}, {
"name" : "deviceBrowser",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "deviceBrand",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "deviceOs",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "returnedInMarket",
"type" : [ "null", {
"type" : "int",
"logicalType" : "date"
} ],
"default" : null
} ]
}
以及镶木地板文件的一行:
{"Id": "34e9bbcd3dd577d6bc3f9b82d9dd99666dafa0203486d2a604f59b7702d50d7d", "anonymousIds": [{"element": "5510"}], "lastConsentInsightUsage": true, "lastConsentActivationUsage": true, "gender": "F", "age": 40, "iata": "", "continent": "EU", "country": "DE", "city": "Frankfurt (Oder)", "state": "BB", "brandVisit": 9, "knownProfileInMarket": 18765, "deviceBrowser": "Googlebot", "deviceBrand": "Spider", "deviceOs": "Other", "returnedInMarket": 18765}
I'm trying to read a profiles
table from Athena in PySpark using Glue client from boto3 and checking if it's empty. Why Spark bug on converting Int to Long, knowing that I do not have Long type in the table read? There is nothing on Google, nor on StackOverflow that answers this problem.
Here is a code sum-up:
dataframe = GlueContext(session.sparkContext).create_dynamic_frame.from_catalog(
database="xxx",
table_name="profiles",
catalog_id="xxx"
).toDF()
if dataframe.rdd.isEmpty():
dataframe = session.sparkContext.emptyRDD().toDF(schema)
I'm getting the error:
ERROR GlueExceptionAnalysisListener: [Glue Exception Analysis] Event: GlueETLJobExceptionEvent
[...]
File "/myScript.py", line 246, in load_table
if dataframe.rdd.isEmpty()
[...]
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file s3://bucket/path/to/profiles/vault=c27/subgroup=1/part-00003-a97d95f5-713c-4756-808b-38c3866842cb.c000.snappy.parquet
[...]
Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt
Here is the Athena DDL:
CREATE EXTERNAL TABLE `profiles`(
`id` string,
`anonymousids` array<string>,
`lastconsentinsightusage` boolean,
`lastconsentactivationusage` boolean,
`gender` string,
`age` int,
`iata` string,
`continent` string,
`country` string,
`city` string,
`state` string,
`brandvisit` int,
`knownprofileinmarket` date,
`devicebrowser` string,
`devicebrand` string,
`deviceos` string,
`returnedinmarket` date)
PARTITIONED BY (
`vault` varchar(5),
`subgroup` int)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://bucket/path/to/profiles'
TBLPROPERTIES (
'classification'='parquet',
'transient_lastDdlTime'='1645604528')
And here is the parquet schema:
{
"type" : "record",
"name" : "spark_schema",
"fields" : [ {
"name" : "Id",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "anonymousIds",
"type" : [ "null", {
"type" : "array",
"items" : {
"type" : "record",
"name" : "list",
"fields" : [ {
"name" : "element",
"type" : [ "null", "string" ],
"default" : null
} ]
}
} ],
"default" : null
}, {
"name" : "lastConsentInsightUsage",
"type" : [ "null", "boolean" ],
"default" : null
}, {
"name" : "lastConsentActivationUsage",
"type" : [ "null", "boolean" ],
"default" : null
}, {
"name" : "gender",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "age",
"type" : [ "null", "int" ],
"default" : null
}, {
"name" : "iata",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "continent",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "country",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "city",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "state",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "brandVisit",
"type" : [ "null", "int" ],
"default" : null
}, {
"name" : "knownProfileInMarket",
"type" : [ "null", {
"type" : "int",
"logicalType" : "date"
} ],
"default" : null
}, {
"name" : "deviceBrowser",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "deviceBrand",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "deviceOs",
"type" : [ "null", "string" ],
"default" : null
}, {
"name" : "returnedInMarket",
"type" : [ "null", {
"type" : "int",
"logicalType" : "date"
} ],
"default" : null
} ]
}
And a line of the parquet file:
{"Id": "34e9bbcd3dd577d6bc3f9b82d9dd99666dafa0203486d2a604f59b7702d50d7d", "anonymousIds": [{"element": "5510"}], "lastConsentInsightUsage": true, "lastConsentActivationUsage": true, "gender": "F", "age": 40, "iata": "", "continent": "EU", "country": "DE", "city": "Frankfurt (Oder)", "state": "BB", "brandVisit": 9, "knownProfileInMarket": 18765, "deviceBrowser": "Googlebot", "deviceBrand": "Spider", "deviceOs": "Other", "returnedInMarket": 18765}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论