当前位置：文江博客话题详情

蜂巢表加载来自HDFS位置的数据，并带有处理的重复文件

发布于 2025-01-22 02:40:20 字数 160 浏览 0 评论 0原文

如果每日文件加载HDFS位置的特定路径，则存在场景。在该路径的基础上，我们创建了Hive外部表，将数据加载到Hive中的表中。最糟糕的情况将文件推到特定路径（HDFS）两次或复制文件。

我们如何加载第二个文件而不是执行删除或其他运行的作业。处理这种情况的最佳实践是什么？

请澄清

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

绝影如岚 2025-01-29 02:40:20

HDF中无法使用具有类似filanames的重复文件。如果您担心两个可能具有类似内容的文件，则可能需要加载它，以免丢失数据并维护处理重复项的托管表。

用例：仅获取最新文件

从HDFS目录中检测最新文件：

HDFS DFS -LS -R/your/your/hdfs/dir/| awk -f“”'{print $ 6“” $ 7“” $ 8}'|排序-nr |头-1 | cut -d“” -f3

然后将其移至另一个HDFS目录。该目录应该清空，因为我们只需要最新的文件。

# delete old files in here
hdfs dfs -rm -r /your/hdfs/latest_dir/ 
# copy latest file 
hdfs dfs -cp $(hadoop fs -ls -R /your/hdfs/dir/ | awk -F" " '{print $6" "$7" "$8}' | sort -nr | head -1 | cut -d" " -f3) /your/hdfs/latest_dir/

Duplicated files with similar filanames are not possible in HDFS. If you worry about two files with possible similar content, you might want to load it as is to avoid missing data and maintain a managed table that handles the duplicates.

Use case: Get only latest file

Detect latest file from HDFS directory:

hdfs dfs -ls -R /your/hdfs/dir/ | awk -F" " '{print $6" "$7" "$8}' | sort -nr | head -1 | cut -d" " -f3

Then, move it to another HDFS directory. This directory should be emptied because we want latest file only.

# delete old files in here
hdfs dfs -rm -r /your/hdfs/latest_dir/ 
# copy latest file 
hdfs dfs -cp $(hadoop fs -ls -R /your/hdfs/dir/ | awk -F" " '{print $6" "$7" "$8}' | sort -nr | head -1 | cut -d" " -f3) /your/hdfs/latest_dir/

回复收藏 0 原文

~没有更多了~