更改Parquet文件列值而不更改Parquet文件的名称,以免更改_spark_metadata

发布于 2025-01-18 04:13:17 字数 1004 浏览 3 评论 0原文

因此,我有一个火花结构化的流媒体作业,该作业将在日期分区的文件夹中以Parquet格式存储在HDFS位置中的JSON消息数据范围,即/源/无线/active_portal/actival/activation/dt =当前日期,它还创建了_spark_metadata文件已写入输出位置。 此HDFS位置在其顶部有一个蜂巢桌。

我必须将特定的列值引脚更改为所有parquet文件中的默认值xxxx,而不会更改其名称,因此_SPARK_METADATA不会受到干扰,

我可以通过使用SPARK命令或通过SPARK命令或通过SPARK命令在HDFS位置中更改列中的数据使用Hive命令。

schema of parquet files 
root
 |-- accNumber: string (nullable = true)
 |-- accountPin: string (nullable = false)
 |-- firstName: string (nullable = true)
 |-- lastName: string (nullable = true)
 |-- dt: date (nullable = false)
 |-- load_time: timestamp (nullable = false)

spark metadata file:
{"path":"hdfs://HDPDEVNN/source/wireless/activation_portal/activation/dt=2022-03-30/part-00000-02b9b4f6-ea66-483e-9348-a9b87f33a232.c000.snappy.parquet","size":6834,"isDir":false,"modificationTime":1648655859206,"blockReplication":2,"blockSize":134217728,"action":"add"}

我尝试在Spark DataFrame并更改列和重写中获取所有数据 我也已经更改了

我在Hive中使用插入optrrite语句目录语句尝试的,但这也创建了具有不同名称的文件

So I have a spark structured streaming job that stores flattened json messages dataframes in a hdfs location in parquet format in date partitioned folders i.e /source/wireless/active_portal/activation/dt=current date, it is also creating _spark_metadata which has name of the files it has written into the output location.
this HDFS location has a hive table on top of it.

I have to change a particular column value PIN to a default value XXXX in all parquet files without changing their names so the _spark_metadata is not disturbed

Is there a way I can change data in a column in the hdfs location either by using spark commands or by using hive commands.

schema of parquet files 
root
 |-- accNumber: string (nullable = true)
 |-- accountPin: string (nullable = false)
 |-- firstName: string (nullable = true)
 |-- lastName: string (nullable = true)
 |-- dt: date (nullable = false)
 |-- load_time: timestamp (nullable = false)

spark metadata file:
{"path":"hdfs://HDPDEVNN/source/wireless/activation_portal/activation/dt=2022-03-30/part-00000-02b9b4f6-ea66-483e-9348-a9b87f33a232.c000.snappy.parquet","size":6834,"isDir":false,"modificationTime":1648655859206,"blockReplication":2,"blockSize":134217728,"action":"add"}

I tried getting all data in a spark dataframe and changing column and rewriting but that changes the name of the files and disturbs spark_metadata file now I cant read the whole folder using spark since it gives me xyz.parquet file does not exist bcoz the file name has been changed

also I tried with Insert Overwrite statement Directory statement in Hive but that also creates files with different names

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

江湖彼岸 2025-01-25 04:13:17

您无法修改HDFS上的文件,因此您必须使用Hive。假设您的表不是交易表,则无法修改到适当的数据,因此您必须通过另一个表将其洗涤。

这是我如何处理这个问题的示例。从内存中写下此内容,因此可能不是完全正确的语法。

create table temptable like originaltable;
insert overwrite table temptable partition (dt)
select 
col1,
col2,
'XXX' as col3
from
originaltable;

现在,您已经有了更改的数据,您只需要将其放回原始表中即可。

insert overwrite table original table partition (dt)
select *
from
temptable;

You can't modify a file on HDFS, so you have to use Hive. Assuming your table isn't a transactional table, you can't modify the data in place, so you have to wash it through another table.

Here's a dummied up example of how I'd approach this. Writing this from memory, so it may not be entirely correct syntax.

create table temptable like originaltable;
insert overwrite table temptable partition (dt)
select 
col1,
col2,
'XXX' as col3
from
originaltable;

Now you've got your changed data, you just need to put it back in the orignal table.

insert overwrite table original table partition (dt)
select *
from
temptable;
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文