更改Parquet文件列值而不更改Parquet文件的名称，以免更改_spark_metadata

发布于 2025-01-18 04:13:17 字数 1004 浏览 3 评论 0原文

因此，我有一个火花结构化的流媒体作业，该作业将在日期分区的文件夹中以Parquet格式存储在HDFS位置中的JSON消息数据范围，即/源/无线/active_portal/actival/activation/dt =当前日期，它还创建了_spark_metadata文件已写入输出位置。此HDFS位置在其顶部有一个蜂巢桌。

我必须将特定的列值引脚更改为所有parquet文件中的默认值xxxx，而不会更改其名称，因此_SPARK_METADATA不会受到干扰，

我可以通过使用SPARK命令或通过SPARK命令或通过SPARK命令在HDFS位置中更改列中的数据使用Hive命令。

schema of parquet files 
root
 |-- accNumber: string (nullable = true)
 |-- accountPin: string (nullable = false)
 |-- firstName: string (nullable = true)
 |-- lastName: string (nullable = true)
 |-- dt: date (nullable = false)
 |-- load_time: timestamp (nullable = false)

spark metadata file:
{"path":"hdfs://HDPDEVNN/source/wireless/activation_portal/activation/dt=2022-03-30/part-00000-02b9b4f6-ea66-483e-9348-a9b87f33a232.c000.snappy.parquet","size":6834,"isDir":false,"modificationTime":1648655859206,"blockReplication":2,"blockSize":134217728,"action":"add"}

我尝试在Spark DataFrame并更改列和重写中获取所有数据我也已经更改了

我在Hive中使用插入optrrite语句目录语句尝试的，但这也创建了具有不同名称的文件

原文

So I have a spark structured streaming job that stores flattened json messages dataframes in a hdfs location in parquet format in date partitioned folders i.e /source/wireless/active_portal/activation/dt=current date, it is also creating _spark_metadata which has name of the files it has written into the output location.
this HDFS location has a hive table on top of it.

I have to change a particular column value PIN to a default value XXXX in all parquet files without changing their names so the _spark_metadata is not disturbed

Is there a way I can change data in a column in the hdfs location either by using spark commands or by using hive commands.

schema of parquet files 
root
 |-- accNumber: string (nullable = true)
 |-- accountPin: string (nullable = false)
 |-- firstName: string (nullable = true)
 |-- lastName: string (nullable = true)
 |-- dt: date (nullable = false)
 |-- load_time: timestamp (nullable = false)

spark metadata file:
{"path":"hdfs://HDPDEVNN/source/wireless/activation_portal/activation/dt=2022-03-30/part-00000-02b9b4f6-ea66-483e-9348-a9b87f33a232.c000.snappy.parquet","size":6834,"isDir":false,"modificationTime":1648655859206,"blockReplication":2,"blockSize":134217728,"action":"add"}

I tried getting all data in a spark dataframe and changing column and rewriting but that changes the name of the files and disturbs spark_metadata file now I cant read the whole folder using spark since it gives me xyz.parquet file does not exist bcoz the file name has been changed

also I tried with Insert Overwrite statement Directory statement in Hive but that also creates files with different names

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

江湖彼岸 2025-01-25 04:13:17

您无法修改HDFS上的文件，因此您必须使用Hive。假设您的表不是交易表，则无法修改到适当的数据，因此您必须通过另一个表将其洗涤。

这是我如何处理这个问题的示例。从内存中写下此内容，因此可能不是完全正确的语法。

create table temptable like originaltable;
insert overwrite table temptable partition (dt)
select 
col1,
col2,
'XXX' as col3
from
originaltable;

现在，您已经有了更改的数据，您只需要将其放回原始表中即可。

insert overwrite table original table partition (dt)
select *
from
temptable;

You can't modify a file on HDFS, so you have to use Hive. Assuming your table isn't a transactional table, you can't modify the data in place, so you have to wash it through another table.

Here's a dummied up example of how I'd approach this. Writing this from memory, so it may not be entirely correct syntax.

create table temptable like originaltable;
insert overwrite table temptable partition (dt)
select 
col1,
col2,
'XXX' as col3
from
originaltable;

Now you've got your changed data, you just need to put it back in the orignal table.

insert overwrite table original table partition (dt)
select *
from
temptable;

回复收藏 0 原文

~没有更多了~

关于作者

两相知

暂无简介

文章

29 人气

关注发私信

友情链接

文江博客

更改Parquet文件列值而不更改Parquet文件的名称，以免更改_spark_metadata

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

更改Parquet文件列值而不更改Parquet文件的名称，以免更改_spark_metadata

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。