将火花数据帧插入分区表
我已经看到了用于插入蜂巢表的方法,例如insertinto(table_name,oftrite = true
,但我无法弄清楚如何处理下面的方案。
对于第一次运行,类似的数据框架需要保存在表格中,由“ date_key”分区
+---+----------+
| id| date_key|
+---+----------+
| 1|202201 |
| 2|202203 |
| 3|202201 |
+---+----------+
。 想使用date_key
将新数据附加到其相应
+---+----------+
| id| date_key|
+---+----------+
| 4|202204 |
| 5|202203 |
| 6|202204 |
+---+----------+
,我
- 分区
- 的 例如上面的样本输入?
在每次运行中,
都将来自多个分区的新数据, 我无法使用df.write.partitionby(“ date_key”)。insertinto(table_name)
,因为有一个错误,说insertinto
无法与一起使用。 partitionby
。
I have seen methods for inserting into Hive table, such as insertInto(table_name, overwrite =True
, but I couldn't work out how to handle the scenario below.
For the first run, a dataframe like this needs to be saved in a table, partitioned by 'date_key'. There could be one or more partitions eg 202201
and 202203
+---+----------+
| id| date_key|
+---+----------+
| 1|202201 |
| 2|202203 |
| 3|202201 |
+---+----------+
For subsequent run, the data comes in also like this, and I'd like to append the new data to their corresponding partitions using date_key
+---+----------+
| id| date_key|
+---+----------+
| 4|202204 |
| 5|202203 |
| 6|202204 |
+---+----------+
Could you please help to shed some light on how to handle
- if during each run there will only be new data from one partition
- if during each run there will new data from multiple partitions, like the sample inputs above?
Many thanks for your help. Let me know if I can explain the problem better.
Edited:
I could not use df.write.partitionBy("date_key").insertInto(table_name)
, as there was an error saying insertInto
can not be used together with partitionBy
.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
在我的示例中,第一次运行将创建新的分区表
数据
。C2
是分区列。第二次运行,您不要需要任何花哨的东西,只需
附加
和insertinto
。 Spark知道您有C2
是分区列,并且它将不必通过partitionby
,In my example here, first run will create new partitioned table
data
.c2
is the partition column.Second run, you don't need anything fancy, just
append
andinsertInto
. Spark knows you havec2
is the partition column and will it properly, you don't have to tell it viapartitionBy
,如果表是外部表,则可以使用以下代码将数据写入外部分区表,
如果它是蜂巢托管表
如下
if the table is an external table you can use the following code to write the data out to the external partitioned table
If it is a hive managed table then you can simply use the
saveAsTable
APIas follows