Databricks Runtime 10.4 LTS - AnalysisException:升级后 0、1 中没有此类结构字段 id
我们正在努力从 9.1 LTS 迁移到 data bricks 运行时 10.4 LTS,但我们遇到了奇怪的行为问题。我们现有的代码直到运行时 10.3 为止都可以工作,在 10.4 中它停止工作。
问题: 我们有一个嵌套的 Json 文件,我们使用下面的代码将其平铺到 Spark 数据框架中:
adaccountsdf = df.withColumn('Exp_Organizations',
F.explode(F.col('organizations.organization')))\
.withColumn('Exp_AdAccounts',
F.explode(F.col('Exp_Organizations.ad_accounts')))\
.select(F.col('Exp_Organizations.id').alias('organizationId'),
F.col('Exp_Organizations.name').alias('organizationName'),
F.col('Exp_AdAccounts.id').alias('adAccountId'),
F.col('Exp_AdAccounts.name').alias('adAccountName'),
F.col('Exp_AdAccounts.timezone').alias('timezone'))
现在,当我们查询数据框架时,当我们执行以下选择时,它会起作用(由于保密而隐藏结果):
display(adaccountsdf.select("*"))
此处上述语句的结果:
当我显示数据帧的架构时,我们得到以下结果:
root
|-- organizationId: string (nullable = true)
|-- organizationName: string (nullable = true)
|-- adAccountId: string (nullable = true)
|-- adAccountName: string (nullable = true)
|-- timezone: string (nullable = true)
所以一切看起来都应该如此。当我们开始选择最后 3 个字段(adAccountId、adAccountName 和时区)时:
display(adaccountsdf.select("adAccountId","adAccountName"))
我们收到错误 AnalysisException: No such struct field id in 0, 1。
上述语句的结果图像:
但是,当我运行语句 display(adaccountsdf.select("adAccountId"))
时,它工作得很好。
有谁知道为什么会发生这种情况?这是一个非常奇怪的错误,仅出现在 databricks 运行时 10.4 中。所有以前的运行时(包括 10.3、10.2、10.1 和 9.1 LTS)都可以正常工作。该问题似乎是由于在数据框中已分解的列上使用分解函数引起的。
更新:
由于某种原因,当我在运行 select 语句之前运行 adaccountsdf.cache()
时,问题消失了。我仍然想知道运行时 10.4 中导致此问题的原因,但不是其他问题。
We are working to migrate to data bricks runtime 10.4 LTS from 9.1 LTS but we're running into weird behavioral issues. Our existing code works up until runtime 10.3 and in 10.4 it stopped working.
Problem:
We have a nested Json file that we are flattening into a spark data frame using the code below:
adaccountsdf = df.withColumn('Exp_Organizations',
F.explode(F.col('organizations.organization')))\
.withColumn('Exp_AdAccounts',
F.explode(F.col('Exp_Organizations.ad_accounts')))\
.select(F.col('Exp_Organizations.id').alias('organizationId'),
F.col('Exp_Organizations.name').alias('organizationName'),
F.col('Exp_AdAccounts.id').alias('adAccountId'),
F.col('Exp_AdAccounts.name').alias('adAccountName'),
F.col('Exp_AdAccounts.timezone').alias('timezone'))
Now when we query the dataframe it works when we do the following selects (hid results due to confidentiality):
display(adaccountsdf.select("*"))
Result of above statement here:
When I display the schema of the dataframe we get the following:
root
|-- organizationId: string (nullable = true)
|-- organizationName: string (nullable = true)
|-- adAccountId: string (nullable = true)
|-- adAccountName: string (nullable = true)
|-- timezone: string (nullable = true)
so everything looks like it should. The moment we start selecting the last 3 fields(adAccountId, adAccountName and timezone):
display(adaccountsdf.select("adAccountId","adAccountName"))
We get the error AnalysisException: No such struct field id in 0, 1.
Image of the result of above statement:
However when I run the statement display(adaccountsdf.select("adAccountId"))
it works just fine.
Does anyone know why this is happening? It's a very strange error that only shows up in databricks runtime 10.4. All previous runtimes incl 10.3, 10.2,10.1 and 9.1 LTS work fine. The issue seems to be caused by using the explode function on an already exploded column in the data frame.
UPDATE:
For some reason when I run adaccountsdf.cache()
before I run my select statements the issue disappears. I would still like to know what's causing this issue in runtime 10.4 but not the other ones.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
就我而言,我在 Databricks 10.4 LTS 上也遇到了类似的问题。我需要添加多个缓存,但会破坏执行计划。向 Microsoft 提交支持票证后,错误修复程序已上传到我们的映像,问题得到解决(催化剂优化器正在无限优化复杂类型或同一列中的操作的执行计划)
In my case, I had a similar issue with the Databricks 10.4 LTS. I needed to add several caches but to break the execution plan. After opening a support ticket to Microsoft a bugfix was uploaded to our image and the problem was resolved (the catalyst optimizer was infinitely optimizing the execution plan for complex types or with operations in the same column)