PySpark：MutableLong 无法转换为 MutableInt（数据帧中没有 long）

发布于 2025-01-17 01:57:02 字数 4703 浏览 2 评论 0原文

我正在尝试使用 boto3 中的 Glue 客户端从 PySpark 中的 Athena 读取 profiles 表并检查它是否为空。为什么 Spark 在将 Int 转换为 Long 时出现错误，知道我在读取的表中没有 Long 类型？ Google 和 StackOverflow 上都没有任何内容可以回答这个问题。

这是代码总结：

dataframe = GlueContext(session.sparkContext).create_dynamic_frame.from_catalog(
                database="xxx",
                table_name="profiles",
                catalog_id="xxx"
            ).toDF()
if dataframe.rdd.isEmpty():
     dataframe = session.sparkContext.emptyRDD().toDF(schema)

我收到错误：

ERROR GlueExceptionAnalysisListener: [Glue Exception Analysis] Event: GlueETLJobExceptionEvent
[...]
  File "/myScript.py", line 246, in load_table
   if dataframe.rdd.isEmpty()
[...]
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file s3://bucket/path/to/profiles/vault=c27/subgroup=1/part-00003-a97d95f5-713c-4756-808b-38c3866842cb.c000.snappy.parquet
[...]
Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt

这是 Athena DDL：

CREATE EXTERNAL TABLE `profiles`(
  `id` string, 
  `anonymousids` array<string>, 
  `lastconsentinsightusage` boolean, 
  `lastconsentactivationusage` boolean, 
  `gender` string, 
  `age` int, 
  `iata` string, 
  `continent` string, 
  `country` string, 
  `city` string, 
  `state` string, 
  `brandvisit` int, 
  `knownprofileinmarket` date, 
  `devicebrowser` string, 
  `devicebrand` string, 
  `deviceos` string, 
  `returnedinmarket` date)
PARTITIONED BY ( 
  `vault` varchar(5), 
  `subgroup` int)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://bucket/path/to/profiles'
TBLPROPERTIES (
  'classification'='parquet', 
  'transient_lastDdlTime'='1645604528')

这是镶木地板架构：

{
  "type" : "record",
  "name" : "spark_schema",
  "fields" : [ {
    "name" : "Id",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "anonymousIds",
    "type" : [ "null", {
      "type" : "array",
      "items" : {
        "type" : "record",
        "name" : "list",
        "fields" : [ {
          "name" : "element",
          "type" : [ "null", "string" ],
          "default" : null
        } ]
      }
    } ],
    "default" : null
  }, {
    "name" : "lastConsentInsightUsage",
    "type" : [ "null", "boolean" ],
    "default" : null
  }, {
    "name" : "lastConsentActivationUsage",
    "type" : [ "null", "boolean" ],
    "default" : null
  }, {
    "name" : "gender",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "age",
    "type" : [ "null", "int" ],
    "default" : null
  }, {
    "name" : "iata",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "continent",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "country",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "city",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "state",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "brandVisit",
    "type" : [ "null", "int" ],
    "default" : null
  }, {
    "name" : "knownProfileInMarket",
    "type" : [ "null", {
      "type" : "int",
      "logicalType" : "date"
    } ],
    "default" : null
  }, {
    "name" : "deviceBrowser",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "deviceBrand",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "deviceOs",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "returnedInMarket",
    "type" : [ "null", {
      "type" : "int",
      "logicalType" : "date"
    } ],
    "default" : null
  } ]
}

以及镶木地板文件的一行：

{"Id": "34e9bbcd3dd577d6bc3f9b82d9dd99666dafa0203486d2a604f59b7702d50d7d", "anonymousIds": [{"element": "5510"}], "lastConsentInsightUsage": true, "lastConsentActivationUsage": true, "gender": "F", "age": 40, "iata": "", "continent": "EU", "country": "DE", "city": "Frankfurt (Oder)", "state": "BB", "brandVisit": 9, "knownProfileInMarket": 18765, "deviceBrowser": "Googlebot", "deviceBrand": "Spider", "deviceOs": "Other", "returnedInMarket": 18765}

原文

I'm trying to read a profiles table from Athena in PySpark using Glue client from boto3 and checking if it's empty. Why Spark bug on converting Int to Long, knowing that I do not have Long type in the table read? There is nothing on Google, nor on StackOverflow that answers this problem.

Here is a code sum-up:

dataframe = GlueContext(session.sparkContext).create_dynamic_frame.from_catalog(
                database="xxx",
                table_name="profiles",
                catalog_id="xxx"
            ).toDF()
if dataframe.rdd.isEmpty():
     dataframe = session.sparkContext.emptyRDD().toDF(schema)

I'm getting the error:

ERROR GlueExceptionAnalysisListener: [Glue Exception Analysis] Event: GlueETLJobExceptionEvent
[...]
  File "/myScript.py", line 246, in load_table
   if dataframe.rdd.isEmpty()
[...]
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file s3://bucket/path/to/profiles/vault=c27/subgroup=1/part-00003-a97d95f5-713c-4756-808b-38c3866842cb.c000.snappy.parquet
[...]
Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt

Here is the Athena DDL:

CREATE EXTERNAL TABLE `profiles`(
  `id` string, 
  `anonymousids` array<string>, 
  `lastconsentinsightusage` boolean, 
  `lastconsentactivationusage` boolean, 
  `gender` string, 
  `age` int, 
  `iata` string, 
  `continent` string, 
  `country` string, 
  `city` string, 
  `state` string, 
  `brandvisit` int, 
  `knownprofileinmarket` date, 
  `devicebrowser` string, 
  `devicebrand` string, 
  `deviceos` string, 
  `returnedinmarket` date)
PARTITIONED BY ( 
  `vault` varchar(5), 
  `subgroup` int)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://bucket/path/to/profiles'
TBLPROPERTIES (
  'classification'='parquet', 
  'transient_lastDdlTime'='1645604528')

And here is the parquet schema:

{
  "type" : "record",
  "name" : "spark_schema",
  "fields" : [ {
    "name" : "Id",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "anonymousIds",
    "type" : [ "null", {
      "type" : "array",
      "items" : {
        "type" : "record",
        "name" : "list",
        "fields" : [ {
          "name" : "element",
          "type" : [ "null", "string" ],
          "default" : null
        } ]
      }
    } ],
    "default" : null
  }, {
    "name" : "lastConsentInsightUsage",
    "type" : [ "null", "boolean" ],
    "default" : null
  }, {
    "name" : "lastConsentActivationUsage",
    "type" : [ "null", "boolean" ],
    "default" : null
  }, {
    "name" : "gender",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "age",
    "type" : [ "null", "int" ],
    "default" : null
  }, {
    "name" : "iata",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "continent",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "country",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "city",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "state",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "brandVisit",
    "type" : [ "null", "int" ],
    "default" : null
  }, {
    "name" : "knownProfileInMarket",
    "type" : [ "null", {
      "type" : "int",
      "logicalType" : "date"
    } ],
    "default" : null
  }, {
    "name" : "deviceBrowser",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "deviceBrand",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "deviceOs",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "returnedInMarket",
    "type" : [ "null", {
      "type" : "int",
      "logicalType" : "date"
    } ],
    "default" : null
  } ]
}

And a line of the parquet file:

{"Id": "34e9bbcd3dd577d6bc3f9b82d9dd99666dafa0203486d2a604f59b7702d50d7d", "anonymousIds": [{"element": "5510"}], "lastConsentInsightUsage": true, "lastConsentActivationUsage": true, "gender": "F", "age": 40, "iata": "", "continent": "EU", "country": "DE", "city": "Frankfurt (Oder)", "state": "BB", "brandVisit": 9, "knownProfileInMarket": 18765, "deviceBrowser": "Googlebot", "deviceBrand": "Spider", "deviceOs": "Other", "returnedInMarket": 18765}

分享到QQ

分享到微博