Azure Databricks-使用spark.sql和union和subquies写入镶木quet文件
问题:
我正在尝试使用spark.sql写入parquet文件,但是我在拥有工会或子征服时会遇到问题。我知道有一些我似乎无法弄清楚的语法。
前任。
%python
df = spark.sql("SELECT
sha2(Code, 256) as COUNTRY_SK,
Code as COUNTRY_CODE,
Name as COUNTRY_NAME,
current_date() as EXTRACT_DATE
FROM raw.EXTR_COUNTRY)
UNION ALL
SELECT
-1 as COUNTRY_SK,
'Unknown' as COUNTRY_CODE,
'Unknown' as COUNTRY_NAME,
current_date() as EXTRACT_DATE")
df.write.parquet("dbfs:/mnt/devstorage/landing/companyx/country",
mode="overwrite")
进行简单查询时,我根本没有问题,例如:
%python
df = spark.sql("select * from raw.EXTR_COUNTRY")
df.write.parquet("dbfs:/mnt/devstorage/landing/companyx/country/",
mode="overwrite")
Issue:
I'm trying to write to parquet file using spark.sql, however I encounter issues when having unions or subqueries. I know there's some syntax I can't seem to figure out.
Ex.
%python
df = spark.sql("SELECT
sha2(Code, 256) as COUNTRY_SK,
Code as COUNTRY_CODE,
Name as COUNTRY_NAME,
current_date() as EXTRACT_DATE
FROM raw.EXTR_COUNTRY)
UNION ALL
SELECT
-1 as COUNTRY_SK,
'Unknown' as COUNTRY_CODE,
'Unknown' as COUNTRY_NAME,
current_date() as EXTRACT_DATE")
df.write.parquet("dbfs:/mnt/devstorage/landing/companyx/country",
mode="overwrite")
WHEN doing a simple query I have no issues at all, such as:
%python
df = spark.sql("select * from raw.EXTR_COUNTRY")
df.write.parquet("dbfs:/mnt/devstorage/landing/companyx/country/",
mode="overwrite")
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您的代码几乎没有问题需要解决:
“
)。相反,您需要使用Tripple Quordes(“”“”“” < /code>或
'''
)union as all
) - 您没有从您需要删除该数据的哪个表。参见 docs SQL语法。我真的建议您分别调试每个子查询,也许首先使用
%SQL
,只有在工作后,将其放入spark.sql
字符串中。另外,由于您正在覆盖数据,因此可以更容易使用
创建或替换表
语法在SQL中执行所有操作( docs ),类似的东西:There are few problems with your code that needs to be fixed:
"
) for multi-line string. Instead you need to use tripple quotes ("""
or'''
)union all
) - you didn't specifyFROM
which table you need to pull that data. See docs for details of the SQL syntax.I really recommend to debug each subquery separately, maybe first using the
%sql
, and only after it works, put it into thespark.sql
string.Also, because you're overwriting the data, it could be easier to use
create or replace table
syntax to perform everything in SQL (docs), something like this:引号解决了问题,SQL-Script本身不是问题。因此,使用Tripple引号(“”或''')解决了问题。
The quotes solved the issue, the sql-script itself wasn't the issue. So using tripple quotes (""" or ''') solved the issue.