如何在 Azure Delta Lake 上创建外部表(类似于 Hive)

发布于 2025-01-17 22:21:24 字数 2362 浏览 2 评论 0 原文

如何在 Azure 数据湖存储上创建外部增量表?我目前正在开展一个迁移项目(从 Pyspark/Hadoop 到 Azure)。我找不到太多有关在 Azure Delta take 中创建非托管表的文档。以下是我目前能够在 Pyspark/Hive/HDFS 设置中执行的一系列操作,想知道如何在 Azure 上建立相同的操作。

按顺序执行的操作 -

  1. 创建数据帧 DF
  2. Drop Hive 外部表(如果存在),使用 DF.write.insertInto("table") 将数据帧 DF 加载到此外部表
  3. 创建数据帧 DF1
  4. Drop Hive 外部表(如果存在) ,使用 DF1.write.insertInto("table") 将数据帧 DF1 加载到此外部表

即使我在加载第二个数据帧之前执行“如果存在则删除表”,如果我查询在步骤 4 之后,我可以看到两个数据帧中的内容,因为我只是“删除”表结构而不是实际数据(Hive 外部表)。它看起来是这样的 -

>>> df = spark.createDataFrame([('abcd','xyz')], ['s', 'd'])
>>> df1 = spark.createDataFrame([('abcd1','xyz1')], ['s', 'd'])
>>> spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS mydb.test_table (s string,d string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'hdfs://system/dev/stage/test_table'")
>>> df.write.insertInto("mydb.test_table",overwrite=True)
>>> spark.sql('DROP TABLE IF EXISTS mydb.test_table')
>>> df1.write.insertInto("mydb.test_table",overwrite=False)
>>> spark.sql("select * from mydb.test_table").show()
+-----+----+
|    s|   d|
+-----+----+
|abcd1|xyz1|
| abcd| xyz|
+-----+----+

我正在尝试使用 Azure Delta Lake 表执行类似的操作,步骤如下 -

  1. 创建数据帧。
  2. 将数据帧保存在 ADLS 中(可能这就是我在这里做错的事情,我应该使用挂载 dbfs 路径而不是容器吗?)
  3. 在此路径之上创建非托管表。

这是我的 Databricks 笔记本中的代码。

df = spark.createDataFrame([('abcd','xyz')], ['s', 'd'])
table_path = f"abfss://[email protected]/stage/test_table"
df.write.format("delta").mode("overwrite").option("path",table_path)
spark.sql("CREATE TABLE test_table USING DELTA LOCATION 'abfss://[email protected]/stage/test_table'")

但是,它没有在位置 table_path 中写入数据帧,并且最后一步无法创建表(可能这里需要 dbfs: 安装路径?)。如何使用非托管 Delta Lake 表执行类似的操作?

How do I create external Delta tables on Azure Data lake storage? I am currently working on a migration project (from Pyspark/Hadoop to Azure). I couldn't find much documentation around creating unmanaged tables in Azure Delta take. Here is a sequence of operations that I am currently able to perform in Pyspark/Hive/HDFS setup, wonder how can I establish the same on Azure.

Actions in sequence-

  1. Create a dataframe DF
  2. Drop Hive external table if exists, load dataframe DF to this external table using DF.write.insertInto("table")
  3. create a dataframe DF1
  4. Drop Hive external table if exists, load dataframe DF1 to this external table using DF1.write.insertInto("table")

Even though I perform "drop tables if exists" before loading 2nd dataframe, if I query the "table" after step 4, I can see content from both dataframes because I am just "dropping" the table structure and not the actual data (Hive External Table). Here is how it looks-

>>> df = spark.createDataFrame([('abcd','xyz')], ['s', 'd'])
>>> df1 = spark.createDataFrame([('abcd1','xyz1')], ['s', 'd'])
>>> spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS mydb.test_table (s string,d string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'hdfs://system/dev/stage/test_table'")
>>> df.write.insertInto("mydb.test_table",overwrite=True)
>>> spark.sql('DROP TABLE IF EXISTS mydb.test_table')
>>> df1.write.insertInto("mydb.test_table",overwrite=False)
>>> spark.sql("select * from mydb.test_table").show()
+-----+----+
|    s|   d|
+-----+----+
|abcd1|xyz1|
| abcd| xyz|
+-----+----+

I am trying to perform the similar using Azure Delta lake Table with below steps-

  1. Create the dataframes.
  2. Save dataframes in ADLS (probably this is what I am doing wrong here, should I use mount dbfs path instead of container?)
  3. Create unmanaged table on top of this path.

Here is the code in my Databricks notebook.

df = spark.createDataFrame([('abcd','xyz')], ['s', 'd'])
table_path = f"abfss://[email protected]/stage/test_table"
df.write.format("delta").mode("overwrite").option("path",table_path)
spark.sql("CREATE TABLE test_table USING DELTA LOCATION 'abfss://[email protected]/stage/test_table'")

However it is not writing the dataframe in the location table_path and final step fails to create the table (probably a dbfs: mount path is required here?). How can I perform similar operations using unmanaged Delta lake tables?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

执着的年纪 2025-01-24 22:21:24
CREATE EXTERNAL TABLE IF NOT EXISTS my_table (name STRING, age INT)
  COMMENT 'This table is created with existing data'
  LOCATION 'spark-warehouse/tables/my_existing_table'

使用上述方法像蜂巢方式一样在三角洲湖中创建外部桌子。

有关完整的参考,请检查下面的Web文档链接

CREATE TABLE [IF NOT EXISTS] [db_name.]table_name
  [(col_name1 col_type1 [COMMENT col_comment1], ...)]
  USING data_source
  [OPTIONS (key1 [ = ] val1, key2 [ = ] val2, ...)]
  [PARTITIONED BY (col_name1, col_name2, ...)]
  [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS]
  [LOCATION path]
  [COMMENT table_comment]
  [TBLPROPERTIES (key1 [ = ] val1, key2 [ = ] val2, ...)]
  [AS select_statement]

[位置路径]有助于提供外部路径。本条款将负责外表创建

CREATE EXTERNAL TABLE IF NOT EXISTS my_table (name STRING, age INT)
  COMMENT 'This table is created with existing data'
  LOCATION 'spark-warehouse/tables/my_existing_table'

Use the above method to create an external table in Delta Lake like Hive manner.

For complete reference, Check the below web doc link
https://learn.microsoft.com/en-us/azure/databricks/spark/2.x/spark-sql/language-manual/create-table

CREATE TABLE [IF NOT EXISTS] [db_name.]table_name
  [(col_name1 col_type1 [COMMENT col_comment1], ...)]
  USING data_source
  [OPTIONS (key1 [ = ] val1, key2 [ = ] val2, ...)]
  [PARTITIONED BY (col_name1, col_name2, ...)]
  [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS]
  [LOCATION path]
  [COMMENT table_comment]
  [TBLPROPERTIES (key1 [ = ] val1, key2 [ = ] val2, ...)]
  [AS select_statement]

[LOCATION path] helps to give the external path. This clause will take care of external table creation

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文