PySpark:MutableLong 无法转换为 MutableInt(数据帧中没有 long)

发布于 2025-01-17 01:57:02 字数 4703 浏览 0 评论 0原文

我正在尝试使用 boto3 中的 Glue 客户端从 PySpark 中的 Athena 读取 profiles 表并检查它是否为空。为什么 Spark 在将 Int 转换为 Long 时出现错误,知道我在读取的表中没有 Long 类型? Google 和 StackOverflow 上都没有任何内容可以回答这个问题。

这是代码总结:

dataframe = GlueContext(session.sparkContext).create_dynamic_frame.from_catalog(
                database="xxx",
                table_name="profiles",
                catalog_id="xxx"
            ).toDF()
if dataframe.rdd.isEmpty():
     dataframe = session.sparkContext.emptyRDD().toDF(schema)

我收到错误:

ERROR GlueExceptionAnalysisListener: [Glue Exception Analysis] Event: GlueETLJobExceptionEvent
[...]
  File "/myScript.py", line 246, in load_table
   if dataframe.rdd.isEmpty()
[...]
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file s3://bucket/path/to/profiles/vault=c27/subgroup=1/part-00003-a97d95f5-713c-4756-808b-38c3866842cb.c000.snappy.parquet
[...]
Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt

这是 Athena DDL:

CREATE EXTERNAL TABLE `profiles`(
  `id` string, 
  `anonymousids` array<string>, 
  `lastconsentinsightusage` boolean, 
  `lastconsentactivationusage` boolean, 
  `gender` string, 
  `age` int, 
  `iata` string, 
  `continent` string, 
  `country` string, 
  `city` string, 
  `state` string, 
  `brandvisit` int, 
  `knownprofileinmarket` date, 
  `devicebrowser` string, 
  `devicebrand` string, 
  `deviceos` string, 
  `returnedinmarket` date)
PARTITIONED BY ( 
  `vault` varchar(5), 
  `subgroup` int)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://bucket/path/to/profiles'
TBLPROPERTIES (
  'classification'='parquet', 
  'transient_lastDdlTime'='1645604528')

这是镶木地板架构:

{
  "type" : "record",
  "name" : "spark_schema",
  "fields" : [ {
    "name" : "Id",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "anonymousIds",
    "type" : [ "null", {
      "type" : "array",
      "items" : {
        "type" : "record",
        "name" : "list",
        "fields" : [ {
          "name" : "element",
          "type" : [ "null", "string" ],
          "default" : null
        } ]
      }
    } ],
    "default" : null
  }, {
    "name" : "lastConsentInsightUsage",
    "type" : [ "null", "boolean" ],
    "default" : null
  }, {
    "name" : "lastConsentActivationUsage",
    "type" : [ "null", "boolean" ],
    "default" : null
  }, {
    "name" : "gender",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "age",
    "type" : [ "null", "int" ],
    "default" : null
  }, {
    "name" : "iata",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "continent",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "country",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "city",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "state",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "brandVisit",
    "type" : [ "null", "int" ],
    "default" : null
  }, {
    "name" : "knownProfileInMarket",
    "type" : [ "null", {
      "type" : "int",
      "logicalType" : "date"
    } ],
    "default" : null
  }, {
    "name" : "deviceBrowser",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "deviceBrand",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "deviceOs",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "returnedInMarket",
    "type" : [ "null", {
      "type" : "int",
      "logicalType" : "date"
    } ],
    "default" : null
  } ]
}

以及镶木地板文件的一行:

{"Id": "34e9bbcd3dd577d6bc3f9b82d9dd99666dafa0203486d2a604f59b7702d50d7d", "anonymousIds": [{"element": "5510"}], "lastConsentInsightUsage": true, "lastConsentActivationUsage": true, "gender": "F", "age": 40, "iata": "", "continent": "EU", "country": "DE", "city": "Frankfurt (Oder)", "state": "BB", "brandVisit": 9, "knownProfileInMarket": 18765, "deviceBrowser": "Googlebot", "deviceBrand": "Spider", "deviceOs": "Other", "returnedInMarket": 18765}

I'm trying to read a profiles table from Athena in PySpark using Glue client from boto3 and checking if it's empty. Why Spark bug on converting Int to Long, knowing that I do not have Long type in the table read? There is nothing on Google, nor on StackOverflow that answers this problem.

Here is a code sum-up:

dataframe = GlueContext(session.sparkContext).create_dynamic_frame.from_catalog(
                database="xxx",
                table_name="profiles",
                catalog_id="xxx"
            ).toDF()
if dataframe.rdd.isEmpty():
     dataframe = session.sparkContext.emptyRDD().toDF(schema)

I'm getting the error:

ERROR GlueExceptionAnalysisListener: [Glue Exception Analysis] Event: GlueETLJobExceptionEvent
[...]
  File "/myScript.py", line 246, in load_table
   if dataframe.rdd.isEmpty()
[...]
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file s3://bucket/path/to/profiles/vault=c27/subgroup=1/part-00003-a97d95f5-713c-4756-808b-38c3866842cb.c000.snappy.parquet
[...]
Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt

Here is the Athena DDL:

CREATE EXTERNAL TABLE `profiles`(
  `id` string, 
  `anonymousids` array<string>, 
  `lastconsentinsightusage` boolean, 
  `lastconsentactivationusage` boolean, 
  `gender` string, 
  `age` int, 
  `iata` string, 
  `continent` string, 
  `country` string, 
  `city` string, 
  `state` string, 
  `brandvisit` int, 
  `knownprofileinmarket` date, 
  `devicebrowser` string, 
  `devicebrand` string, 
  `deviceos` string, 
  `returnedinmarket` date)
PARTITIONED BY ( 
  `vault` varchar(5), 
  `subgroup` int)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://bucket/path/to/profiles'
TBLPROPERTIES (
  'classification'='parquet', 
  'transient_lastDdlTime'='1645604528')

And here is the parquet schema:

{
  "type" : "record",
  "name" : "spark_schema",
  "fields" : [ {
    "name" : "Id",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "anonymousIds",
    "type" : [ "null", {
      "type" : "array",
      "items" : {
        "type" : "record",
        "name" : "list",
        "fields" : [ {
          "name" : "element",
          "type" : [ "null", "string" ],
          "default" : null
        } ]
      }
    } ],
    "default" : null
  }, {
    "name" : "lastConsentInsightUsage",
    "type" : [ "null", "boolean" ],
    "default" : null
  }, {
    "name" : "lastConsentActivationUsage",
    "type" : [ "null", "boolean" ],
    "default" : null
  }, {
    "name" : "gender",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "age",
    "type" : [ "null", "int" ],
    "default" : null
  }, {
    "name" : "iata",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "continent",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "country",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "city",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "state",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "brandVisit",
    "type" : [ "null", "int" ],
    "default" : null
  }, {
    "name" : "knownProfileInMarket",
    "type" : [ "null", {
      "type" : "int",
      "logicalType" : "date"
    } ],
    "default" : null
  }, {
    "name" : "deviceBrowser",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "deviceBrand",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "deviceOs",
    "type" : [ "null", "string" ],
    "default" : null
  }, {
    "name" : "returnedInMarket",
    "type" : [ "null", {
      "type" : "int",
      "logicalType" : "date"
    } ],
    "default" : null
  } ]
}

And a line of the parquet file:

{"Id": "34e9bbcd3dd577d6bc3f9b82d9dd99666dafa0203486d2a604f59b7702d50d7d", "anonymousIds": [{"element": "5510"}], "lastConsentInsightUsage": true, "lastConsentActivationUsage": true, "gender": "F", "age": 40, "iata": "", "continent": "EU", "country": "DE", "city": "Frankfurt (Oder)", "state": "BB", "brandVisit": 9, "knownProfileInMarket": 18765, "deviceBrowser": "Googlebot", "deviceBrand": "Spider", "deviceOs": "Other", "returnedInMarket": 18765}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文