Databricks Runtime 10.4 LTS - AnalysisException：升级后 0、1 中没有此类结构字段 id

发布于 2025-01-20 22:30:48 字数 2165 浏览 1 评论 0原文

我们正在努力从 9.1 LTS 迁移到 data bricks 运行时 10.4 LTS，但我们遇到了奇怪的行为问题。我们现有的代码直到运行时 10.3 为止都可以工作，在 10.4 中它停止工作。

问题：我们有一个嵌套的 Json 文件，我们使用下面的代码将其平铺到 Spark 数据框架中：

adaccountsdf = df.withColumn('Exp_Organizations', 
                             F.explode(F.col('organizations.organization')))\
                 .withColumn('Exp_AdAccounts', 
                             F.explode(F.col('Exp_Organizations.ad_accounts')))\
                 .select(F.col('Exp_Organizations.id').alias('organizationId'),
                         F.col('Exp_Organizations.name').alias('organizationName'),
                         F.col('Exp_AdAccounts.id').alias('adAccountId'),
                         F.col('Exp_AdAccounts.name').alias('adAccountName'),
                         F.col('Exp_AdAccounts.timezone').alias('timezone'))

现在，当我们查询数据框架时，当我们执行以下选择时，它会起作用（由于保密而隐藏结果）：

display(adaccountsdf.select("*"))

此处上述语句的结果：

当我显示数据帧的架构时，我们得到以下结果：

root
|-- organizationId: string (nullable = true)
|-- organizationName: string (nullable = true)
|-- adAccountId: string (nullable = true)
|-- adAccountName: string (nullable = true)
|-- timezone: string (nullable = true)

所以一切看起来都应该如此。当我们开始选择最后 3 个字段（adAccountId、adAccountName 和时区）时：

display(adaccountsdf.select("adAccountId","adAccountName"))

我们收到错误 AnalysisException: No such struct field id in 0, 1。

上述语句的结果图像：

但是，当我运行语句 display(adaccountsdf.select("adAccountId")) 时，它工作得很好。

有谁知道为什么会发生这种情况？这是一个非常奇怪的错误，仅出现在 databricks 运行时 10.4 中。所有以前的运行时（包括 10.3、10.2、10.1 和 9.1 LTS）都可以正常工作。该问题似乎是由于在数据框中已分解的列上使用分解函数引起的。

更新：

由于某种原因，当我在运行 select 语句之前运行 adaccountsdf.cache() 时，问题消失了。我仍然想知道运行时 10.4 中导致此问题的原因，但不是其他问题。

原文

We are working to migrate to data bricks runtime 10.4 LTS from 9.1 LTS but we're running into weird behavioral issues. Our existing code works up until runtime 10.3 and in 10.4 it stopped working.

Problem:
We have a nested Json file that we are flattening into a spark data frame using the code below:

adaccountsdf = df.withColumn('Exp_Organizations', 
                             F.explode(F.col('organizations.organization')))\
                 .withColumn('Exp_AdAccounts', 
                             F.explode(F.col('Exp_Organizations.ad_accounts')))\
                 .select(F.col('Exp_Organizations.id').alias('organizationId'),
                         F.col('Exp_Organizations.name').alias('organizationName'),
                         F.col('Exp_AdAccounts.id').alias('adAccountId'),
                         F.col('Exp_AdAccounts.name').alias('adAccountName'),
                         F.col('Exp_AdAccounts.timezone').alias('timezone'))

Now when we query the dataframe it works when we do the following selects (hid results due to confidentiality):

display(adaccountsdf.select("*"))

Result of above statement here:

When I display the schema of the dataframe we get the following:

root
|-- organizationId: string (nullable = true)
|-- organizationName: string (nullable = true)
|-- adAccountId: string (nullable = true)
|-- adAccountName: string (nullable = true)
|-- timezone: string (nullable = true)

so everything looks like it should. The moment we start selecting the last 3 fields(adAccountId, adAccountName and timezone):

display(adaccountsdf.select("adAccountId","adAccountName"))

We get the error AnalysisException: No such struct field id in 0, 1.

Image of the result of above statement:

However when I run the statement display(adaccountsdf.select("adAccountId")) it works just fine.

Does anyone know why this is happening? It's a very strange error that only shows up in databricks runtime 10.4. All previous runtimes incl 10.3, 10.2,10.1 and 9.1 LTS work fine. The issue seems to be caused by using the explode function on an already exploded column in the data frame.

UPDATE:

For some reason when I run adaccountsdf.cache() before I run my select statements the issue disappears. I would still like to know what's causing this issue in runtime 10.4 but not the other ones.

分享到QQ

分享到微博