Azure Databricks-使用spark.sql和union和subquies写入镶木quet文件

发布于 01-24 17:26 字数 899 浏览 2 评论 0原文

问题:

我正在尝试使用spark.sql写入parquet文件,但是我在拥有工会或子征服时会遇到问题。我知道有一些我似乎无法弄清楚的语法。

前任。

%python
    
df = spark.sql("SELECT
          sha2(Code, 256) as COUNTRY_SK,
          Code as COUNTRY_CODE,
          Name as COUNTRY_NAME,
          current_date() as EXTRACT_DATE
       
        FROM raw.EXTR_COUNTRY)
       
    UNION ALL
    
    SELECT
          -1 as COUNTRY_SK,
          'Unknown' as COUNTRY_CODE,
          'Unknown' as COUNTRY_NAME,
          current_date() as EXTRACT_DATE")
          
df.write.parquet("dbfs:/mnt/devstorage/landing/companyx/country", 
   mode="overwrite")

进行简单查询时,我根本没有问题,例如:

%python
    
df = spark.sql("select * from raw.EXTR_COUNTRY")
df.write.parquet("dbfs:/mnt/devstorage/landing/companyx/country/", 
   mode="overwrite")

Issue:

I'm trying to write to parquet file using spark.sql, however I encounter issues when having unions or subqueries. I know there's some syntax I can't seem to figure out.

Ex.

%python
    
df = spark.sql("SELECT
          sha2(Code, 256) as COUNTRY_SK,
          Code as COUNTRY_CODE,
          Name as COUNTRY_NAME,
          current_date() as EXTRACT_DATE
       
        FROM raw.EXTR_COUNTRY)
       
    UNION ALL
    
    SELECT
          -1 as COUNTRY_SK,
          'Unknown' as COUNTRY_CODE,
          'Unknown' as COUNTRY_NAME,
          current_date() as EXTRACT_DATE")
          
df.write.parquet("dbfs:/mnt/devstorage/landing/companyx/country", 
   mode="overwrite")

WHEN doing a simple query I have no issues at all, such as:

%python
    
df = spark.sql("select * from raw.EXTR_COUNTRY")
df.write.parquet("dbfs:/mnt/devstorage/landing/companyx/country/", 
   mode="overwrite")

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

不必在意 2025-01-31 17:26:22

您的代码几乎没有问题需要解决:

  • 您使用的是多行字符串的单个引号()。相反,您需要使用Tripple Quordes(“”“”“” < /code>或'''
  • 您的SQL语法对于查询的第二部分不正确(union as all) - 您没有从您需要删除该数据的哪个表。参见 docs SQL语法。

我真的建议您分别调试每个子查询,也许首先使用%SQL,只有在工作后,将其放入spark.sql字符串中。

另外,由于您正在覆盖数据,因此可以更容易使用创建或替换表语法在SQL中执行所有操作( docs ),类似的东西:

create or replace table delta.`/mnt/devstorage/landing/companyx/country/` AS (
SELECT
          sha2(Code, 256) as COUNTRY_SK,
          Code as COUNTRY_CODE,
          Name as COUNTRY_NAME,
          current_date() as EXTRACT_DATE
       
        FROM raw.EXTR_COUNTRY)
       
    UNION ALL
    
    SELECT
          -1 as COUNTRY_SK,
          'Unknown' as COUNTRY_CODE,
          'Unknown' as COUNTRY_NAME,
          current_date() as EXTRACT_DATE
          FROM ....
)

There are few problems with your code that needs to be fixed:

  • you're using single quotes (") for multi-line string. Instead you need to use tripple quotes (""" or ''')
  • your SQL syntax is incorrect for second part of the query (after union all) - you didn't specify FROM which table you need to pull that data. See docs for details of the SQL syntax.

I really recommend to debug each subquery separately, maybe first using the %sql, and only after it works, put it into the spark.sql string.

Also, because you're overwriting the data, it could be easier to use create or replace table syntax to perform everything in SQL (docs), something like this:

create or replace table delta.`/mnt/devstorage/landing/companyx/country/` AS (
SELECT
          sha2(Code, 256) as COUNTRY_SK,
          Code as COUNTRY_CODE,
          Name as COUNTRY_NAME,
          current_date() as EXTRACT_DATE
       
        FROM raw.EXTR_COUNTRY)
       
    UNION ALL
    
    SELECT
          -1 as COUNTRY_SK,
          'Unknown' as COUNTRY_CODE,
          'Unknown' as COUNTRY_NAME,
          current_date() as EXTRACT_DATE
          FROM ....
)
银河中√捞星星 2025-01-31 17:26:22

引号解决了问题,SQL-Script本身不是问题。因此,使用Tripple引号(“”或''')解决了问题。

 %python
        
    df = spark.sql("""SELECT
              sha2(Code, 256) as COUNTRY_SK,
              Code as COUNTRY_CODE,
              Name as COUNTRY_NAME,
              current_date() as EXTRACT_DATE
           
            FROM raw.EXTR_COUNTRY)
           
        UNION ALL
        
        SELECT
              -1 as COUNTRY_SK,
              'Unknown' as COUNTRY_CODE,
              'Unknown' as COUNTRY_NAME,
              current_date() as EXTRACT_DATE""")
              
    df.write.parquet("dbfs:/mnt/devstorage/landing/companyx/country", 
       mode="overwrite")

The quotes solved the issue, the sql-script itself wasn't the issue. So using tripple quotes (""" or ''') solved the issue.

 %python
        
    df = spark.sql("""SELECT
              sha2(Code, 256) as COUNTRY_SK,
              Code as COUNTRY_CODE,
              Name as COUNTRY_NAME,
              current_date() as EXTRACT_DATE
           
            FROM raw.EXTR_COUNTRY)
           
        UNION ALL
        
        SELECT
              -1 as COUNTRY_SK,
              'Unknown' as COUNTRY_CODE,
              'Unknown' as COUNTRY_NAME,
              current_date() as EXTRACT_DATE""")
              
    df.write.parquet("dbfs:/mnt/devstorage/landing/companyx/country", 
       mode="overwrite")
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文