更改Parquet文件列值而不更改Parquet文件的名称,以免更改_spark_metadata
因此,我有一个火花结构化的流媒体作业,该作业将在日期分区的文件夹中以Parquet格式存储在HDFS位置中的JSON消息数据范围,即/源/无线/active_portal/actival/activation/dt =当前日期,它还创建了_spark_metadata文件已写入输出位置。 此HDFS位置在其顶部有一个蜂巢桌。
我必须将特定的列值引脚更改为所有parquet文件中的默认值xxxx,而不会更改其名称,因此_SPARK_METADATA不会受到干扰,
我可以通过使用SPARK命令或通过SPARK命令或通过SPARK命令在HDFS位置中更改列中的数据使用Hive命令。
schema of parquet files
root
|-- accNumber: string (nullable = true)
|-- accountPin: string (nullable = false)
|-- firstName: string (nullable = true)
|-- lastName: string (nullable = true)
|-- dt: date (nullable = false)
|-- load_time: timestamp (nullable = false)
spark metadata file:
{"path":"hdfs://HDPDEVNN/source/wireless/activation_portal/activation/dt=2022-03-30/part-00000-02b9b4f6-ea66-483e-9348-a9b87f33a232.c000.snappy.parquet","size":6834,"isDir":false,"modificationTime":1648655859206,"blockReplication":2,"blockSize":134217728,"action":"add"}
我尝试在Spark DataFrame并更改列和重写中获取所有数据 我也已经更改了
我在Hive中使用插入optrrite语句目录语句尝试的,但这也创建了具有不同名称的文件
So I have a spark structured streaming job that stores flattened json messages dataframes in a hdfs location in parquet format in date partitioned folders i.e /source/wireless/active_portal/activation/dt=current date, it is also creating _spark_metadata which has name of the files it has written into the output location.
this HDFS location has a hive table on top of it.
I have to change a particular column value PIN to a default value XXXX in all parquet files without changing their names so the _spark_metadata is not disturbed
Is there a way I can change data in a column in the hdfs location either by using spark commands or by using hive commands.
schema of parquet files
root
|-- accNumber: string (nullable = true)
|-- accountPin: string (nullable = false)
|-- firstName: string (nullable = true)
|-- lastName: string (nullable = true)
|-- dt: date (nullable = false)
|-- load_time: timestamp (nullable = false)
spark metadata file:
{"path":"hdfs://HDPDEVNN/source/wireless/activation_portal/activation/dt=2022-03-30/part-00000-02b9b4f6-ea66-483e-9348-a9b87f33a232.c000.snappy.parquet","size":6834,"isDir":false,"modificationTime":1648655859206,"blockReplication":2,"blockSize":134217728,"action":"add"}
I tried getting all data in a spark dataframe and changing column and rewriting but that changes the name of the files and disturbs spark_metadata file now I cant read the whole folder using spark since it gives me xyz.parquet file does not exist bcoz the file name has been changed
also I tried with Insert Overwrite statement Directory statement in Hive but that also creates files with different names
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您无法修改HDFS上的文件,因此您必须使用Hive。假设您的表不是交易表,则无法修改到适当的数据,因此您必须通过另一个表将其洗涤。
这是我如何处理这个问题的示例。从内存中写下此内容,因此可能不是完全正确的语法。
现在,您已经有了更改的数据,您只需要将其放回原始表中即可。
You can't modify a file on HDFS, so you have to use Hive. Assuming your table isn't a transactional table, you can't modify the data in place, so you have to wash it through another table.
Here's a dummied up example of how I'd approach this. Writing this from memory, so it may not be entirely correct syntax.
Now you've got your changed data, you just need to put it back in the orignal table.