使用Hive阅读时如何忽略空的镶木木材文件
我使用的是Hive 3.1.0,我的查询每小时都会从某些路径上读取一堆镶木件文件。我无法控制这些文件是如何生成的,因为这些文件是由某些外部过程创建的。在某种极少数情况下,发生在指定路径中,某些镶木quet文件可能存在零大小。我希望Hive忽略这一点,但是我的Hive查询失败了以下错误: -
<filename>.parquet is not a Parquet file (too small length: 0)
如何避免这种情况?一个小时内可能会有太多的文件降落,因此创建自动化以检测和删除空文件是一个过度杀伤。我相信,Hive应该有一些简单的选择,以使其忽略此类文件。
I am using Hive 3.1.0 and my query reads a bunch of parquet files from certain path every hour. I don't have control over how these files are generated as these are created by some external process. In some rare case it happens that within the specified path a certain parquet file may exist with zero size. I would like Hive to ignore this but my hive queries fail with the following error:-
<filename>.parquet is not a Parquet file (too small length: 0)
How do I avoid this ? There could be too many files landing in an hour , so it would be an overkill to create automation to detect and delete empty files. I believe there should be some simpler option in Hive to make it just ignore such files.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
尝试使用属性$ file_size。如果超过0,则处理数据负载。如果您可以如何访问查询,那就更好了。
Try to use the property $file_size. If it is more than 0 then process the data load. It would be better if you can provide the query as how you are trying to access.
我不知道该如何作为蜂巢属性做到这一点。 之前,您可能需要在单独的目录中处理空文件。
如果有的话,您可能需要在使用:
find ./ your -directory -type f -ementy -print -delete -delete
或不可能的话 最终存储中的文件。
尝试列出要删除的文件以进行理智检查。
I don't know how to do this as hive property. If ever, you might want to handle empty files in a separate directory before pushing to final storage using:
find ./your-directory -type f -empty -print -delete
or if not possible, handle deleting files in your final storage.
Try to list the files to be deleted for sanity check.