如何在 Azure Delta Lake 上创建外部表（类似于 Hive）

发布于 2025-01-17 22:21:24 字数 2362 浏览 2 评论 0 原文

如何在 Azure 数据湖存储上创建外部增量表？我目前正在开展一个迁移项目（从 Pyspark/Hadoop 到 Azure）。我找不到太多有关在 Azure Delta take 中创建非托管表的文档。以下是我目前能够在 Pyspark/Hive/HDFS 设置中执行的一系列操作，想知道如何在 Azure 上建立相同的操作。

按顺序执行的操作 -

创建数据帧 DF
Drop Hive 外部表（如果存在），使用 DF.write.insertInto("table") 将数据帧 DF 加载到此外部表
创建数据帧 DF1
Drop Hive 外部表（如果存在），使用 DF1.write.insertInto("table") 将数据帧 DF1 加载到此外部表

即使我在加载第二个数据帧之前执行“如果存在则删除表”，如果我查询在步骤 4 之后，我可以看到两个数据帧中的内容，因为我只是“删除”表结构而不是实际数据（Hive 外部表）。它看起来是这样的 -

>>> df = spark.createDataFrame([('abcd','xyz')], ['s', 'd'])
>>> df1 = spark.createDataFrame([('abcd1','xyz1')], ['s', 'd'])
>>> spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS mydb.test_table (s string,d string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'hdfs://system/dev/stage/test_table'")
>>> df.write.insertInto("mydb.test_table",overwrite=True)
>>> spark.sql('DROP TABLE IF EXISTS mydb.test_table')
>>> df1.write.insertInto("mydb.test_table",overwrite=False)
>>> spark.sql("select * from mydb.test_table").show()
+-----+----+
|    s|   d|
+-----+----+
|abcd1|xyz1|
| abcd| xyz|
+-----+----+

我正在尝试使用 Azure Delta Lake 表执行类似的操作，步骤如下 -

创建数据帧。
将数据帧保存在 ADLS 中（可能这就是我在这里做错的事情，我应该使用挂载 dbfs 路径而不是容器吗？）
在此路径之上创建非托管表。

这是我的 Databricks 笔记本中的代码。

df = spark.createDataFrame([('abcd','xyz')], ['s', 'd'])
table_path = f"abfss://[email protected]/stage/test_table"
df.write.format("delta").mode("overwrite").option("path",table_path)
spark.sql("CREATE TABLE test_table USING DELTA LOCATION 'abfss://[email protected]/stage/test_table'")

但是，它没有在位置 table_path 中写入数据帧，并且最后一步无法创建表（可能这里需要 dbfs: 安装路径？）。如何使用非托管 Delta Lake 表执行类似的操作？

原文

How do I create external Delta tables on Azure Data lake storage? I am currently working on a migration project (from Pyspark/Hadoop to Azure). I couldn't find much documentation around creating unmanaged tables in Azure Delta take. Here is a sequence of operations that I am currently able to perform in Pyspark/Hive/HDFS setup, wonder how can I establish the same on Azure.

Actions in sequence-

Create a dataframe DF
Drop Hive external table if exists, load dataframe DF to this external table using DF.write.insertInto("table")
create a dataframe DF1
Drop Hive external table if exists, load dataframe DF1 to this external table using DF1.write.insertInto("table")

Even though I perform "drop tables if exists" before loading 2nd dataframe, if I query the "table" after step 4, I can see content from both dataframes because I am just "dropping" the table structure and not the actual data (Hive External Table). Here is how it looks-

>>> df = spark.createDataFrame([('abcd','xyz')], ['s', 'd'])
>>> df1 = spark.createDataFrame([('abcd1','xyz1')], ['s', 'd'])
>>> spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS mydb.test_table (s string,d string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'hdfs://system/dev/stage/test_table'")
>>> df.write.insertInto("mydb.test_table",overwrite=True)
>>> spark.sql('DROP TABLE IF EXISTS mydb.test_table')
>>> df1.write.insertInto("mydb.test_table",overwrite=False)
>>> spark.sql("select * from mydb.test_table").show()
+-----+----+
|    s|   d|
+-----+----+
|abcd1|xyz1|
| abcd| xyz|
+-----+----+

I am trying to perform the similar using Azure Delta lake Table with below steps-

Create the dataframes.
Save dataframes in ADLS (probably this is what I am doing wrong here, should I use mount dbfs path instead of container?)
Create unmanaged table on top of this path.

Here is the code in my Databricks notebook.

df = spark.createDataFrame([('abcd','xyz')], ['s', 'd'])
table_path = f"abfss://[email protected]/stage/test_table"
df.write.format("delta").mode("overwrite").option("path",table_path)
spark.sql("CREATE TABLE test_table USING DELTA LOCATION 'abfss://[email protected]/stage/test_table'")

However it is not writing the dataframe in the location table_path and final step fails to create the table (probably a dbfs: mount path is required here?). How can I perform similar operations using unmanaged Delta lake tables?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

执着的年纪 2025-01-24 22:21:24

CREATE EXTERNAL TABLE IF NOT EXISTS my_table (name STRING, age INT)
  COMMENT 'This table is created with existing data'
  LOCATION 'spark-warehouse/tables/my_existing_table'

使用上述方法像蜂巢方式一样在三角洲湖中创建外部桌子。

有关完整的参考，请检查下面的Web文档链接

CREATE TABLE [IF NOT EXISTS] [db_name.]table_name
  [(col_name1 col_type1 [COMMENT col_comment1], ...)]
  USING data_source
  [OPTIONS (key1 [ = ] val1, key2 [ = ] val2, ...)]
  [PARTITIONED BY (col_name1, col_name2, ...)]
  [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS]
  [LOCATION path]
  [COMMENT table_comment]
  [TBLPROPERTIES (key1 [ = ] val1, key2 [ = ] val2, ...)]
  [AS select_statement]

[位置路径]有助于提供外部路径。本条款将负责外表创建

CREATE EXTERNAL TABLE IF NOT EXISTS my_table (name STRING, age INT)
  COMMENT 'This table is created with existing data'
  LOCATION 'spark-warehouse/tables/my_existing_table'

Use the above method to create an external table in Delta Lake like Hive manner.

For complete reference, Check the below web doc link
https://learn.microsoft.com/en-us/azure/databricks/spark/2.x/spark-sql/language-manual/create-table

CREATE TABLE [IF NOT EXISTS] [db_name.]table_name
  [(col_name1 col_type1 [COMMENT col_comment1], ...)]
  USING data_source
  [OPTIONS (key1 [ = ] val1, key2 [ = ] val2, ...)]
  [PARTITIONED BY (col_name1, col_name2, ...)]
  [CLUSTERED BY (col_name3, col_name4, ...) INTO num_buckets BUCKETS]
  [LOCATION path]
  [COMMENT table_comment]
  [TBLPROPERTIES (key1 [ = ] val1, key2 [ = ] val2, ...)]
  [AS select_statement]

[LOCATION path] helps to give the external path. This clause will take care of external table creation

回复收藏 0 原文

~没有更多了~