如何在 Azure 数据湖存储上创建外部增量表?我目前正在开展一个迁移项目(从 Pyspark/Hadoop 到 Azure)。我找不到太多有关在 Azure Delta take 中创建非托管表的文档。以下是我目前能够在 Pyspark/Hive/HDFS 设置中执行的一系列操作,想知道如何在 Azure 上建立相同的操作。
按顺序执行的操作 -
- 创建数据帧 DF
- Drop Hive 外部表(如果存在),使用
DF.write.insertInto("table")
将数据帧 DF 加载到此外部表
- 创建数据帧 DF1
- Drop Hive 外部表(如果存在) ,使用 DF1.write.insertInto("table") 将数据帧 DF1 加载到此外部表
即使我在加载第二个数据帧之前执行“如果存在则删除表”,如果我查询在步骤 4 之后,我可以看到两个数据帧中的内容,因为我只是“删除”表结构而不是实际数据(Hive 外部表)。它看起来是这样的 -
>>> df = spark.createDataFrame([('abcd','xyz')], ['s', 'd'])
>>> df1 = spark.createDataFrame([('abcd1','xyz1')], ['s', 'd'])
>>> spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS mydb.test_table (s string,d string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'hdfs://system/dev/stage/test_table'")
>>> df.write.insertInto("mydb.test_table",overwrite=True)
>>> spark.sql('DROP TABLE IF EXISTS mydb.test_table')
>>> df1.write.insertInto("mydb.test_table",overwrite=False)
>>> spark.sql("select * from mydb.test_table").show()
+-----+----+
| s| d|
+-----+----+
|abcd1|xyz1|
| abcd| xyz|
+-----+----+
我正在尝试使用 Azure Delta Lake 表执行类似的操作,步骤如下 -
- 创建数据帧。
- 将数据帧保存在 ADLS 中(可能这就是我在这里做错的事情,我应该使用挂载 dbfs 路径而不是容器吗?)
- 在此路径之上创建非托管表。
这是我的 Databricks 笔记本中的代码。
df = spark.createDataFrame([('abcd','xyz')], ['s', 'd'])
table_path = f"abfss://[email protected]/stage/test_table"
df.write.format("delta").mode("overwrite").option("path",table_path)
spark.sql("CREATE TABLE test_table USING DELTA LOCATION 'abfss://[email protected]/stage/test_table'")
但是,它没有在位置 table_path 中写入数据帧,并且最后一步无法创建表(可能这里需要 dbfs: 安装路径?)。如何使用非托管 Delta Lake 表执行类似的操作?
How do I create external Delta tables on Azure Data lake storage? I am currently working on a migration project (from Pyspark/Hadoop to Azure). I couldn't find much documentation around creating unmanaged tables in Azure Delta take. Here is a sequence of operations that I am currently able to perform in Pyspark/Hive/HDFS setup, wonder how can I establish the same on Azure.
Actions in sequence-
- Create a dataframe DF
- Drop Hive external table if exists, load dataframe DF to this external table using
DF.write.insertInto("table")
- create a dataframe DF1
- Drop Hive external table if exists, load dataframe DF1 to this external table using
DF1.write.insertInto("table")
Even though I perform "drop tables if exists" before loading 2nd dataframe, if I query the "table" after step 4, I can see content from both dataframes because I am just "dropping" the table structure and not the actual data (Hive External Table). Here is how it looks-
>>> df = spark.createDataFrame([('abcd','xyz')], ['s', 'd'])
>>> df1 = spark.createDataFrame([('abcd1','xyz1')], ['s', 'd'])
>>> spark.sql("CREATE EXTERNAL TABLE IF NOT EXISTS mydb.test_table (s string,d string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 'hdfs://system/dev/stage/test_table'")
>>> df.write.insertInto("mydb.test_table",overwrite=True)
>>> spark.sql('DROP TABLE IF EXISTS mydb.test_table')
>>> df1.write.insertInto("mydb.test_table",overwrite=False)
>>> spark.sql("select * from mydb.test_table").show()
+-----+----+
| s| d|
+-----+----+
|abcd1|xyz1|
| abcd| xyz|
+-----+----+
I am trying to perform the similar using Azure Delta lake Table with below steps-
- Create the dataframes.
- Save dataframes in ADLS (probably this is what I am doing wrong here, should I use mount dbfs path instead of container?)
- Create unmanaged table on top of this path.
Here is the code in my Databricks notebook.
df = spark.createDataFrame([('abcd','xyz')], ['s', 'd'])
table_path = f"abfss://[email protected]/stage/test_table"
df.write.format("delta").mode("overwrite").option("path",table_path)
spark.sql("CREATE TABLE test_table USING DELTA LOCATION 'abfss://[email protected]/stage/test_table'")
However it is not writing the dataframe in the location table_path and final step fails to create the table (probably a dbfs: mount path is required here?). How can I perform similar operations using unmanaged Delta lake tables?
发布评论
评论(1)
使用上述方法像蜂巢方式一样在三角洲湖中创建外部桌子。
有关完整的参考,请检查下面的Web文档链接
[位置路径]有助于提供外部路径。本条款将负责外表创建
Use the above method to create an external table in Delta Lake like Hive manner.
For complete reference, Check the below web doc link
https://learn.microsoft.com/en-us/azure/databricks/spark/2.x/spark-sql/language-manual/create-table
[LOCATION path] helps to give the external path. This clause will take care of external table creation