胶水爬行者中排除的文件夹抛出hive_bad_data错误在雅典娜

发布于 2025-01-26 10:38:35 字数 985 浏览 1 评论 0 原文

我正在尝试创建一个胶水爬网来爬行特定的路径模式。我有以下路径:

bucket/inference/2022/04/28/modelling/metadata.tar.gz
bucket/inference/2022/04/28/prediction/predictions.parquet
bucket/inference/2022/04/28/extract/data.parquet

每天都重复相同的模式,即,我们

bucket/inference/2022/04/29/*
bucket/inference/2022/04/30/*

只想抓取 **/precoverions 文件夹中的内容。我已经设置了一个指向 bucket/theprence/的胶水横梁,并具有以下排除模式:

**/modelling/**
**/extract/**

日志正确地显示 bucket/teberperion/temperion/2022/04/28/Moduleing/nodeing/28/ getadata.tar.gz bucket/theperion/2022/28/28/extract/data.parquet 文件被排除在外,DDL元数据显示它正在拾取正确的数量数据中的对象和行。

但是,当我转到 select * 在雅典娜时,我会收到以下错误:

HIVE_BAD_DATA: Not valid Parquet file: s3://bucket/inference/2022/04/28/modelling/metadata.tar.gz expected magic number: PAR1

我尝试了上述每个组合的排除模式,但是它似乎总是在拾取建模文件夹中的内容,尽管该日志明确排除了它。我在这里错过了什么吗?

非常感谢。

I'm trying to create a glue crawler to crawl a specific path pattern. I have the following paths:

bucket/inference/2022/04/28/modelling/metadata.tar.gz
bucket/inference/2022/04/28/prediction/predictions.parquet
bucket/inference/2022/04/28/extract/data.parquet

The same pattern is repeated every day, i.e. we have the above for

bucket/inference/2022/04/29/*
bucket/inference/2022/04/30/*

I only want to crawl what's in the **/predictions folders each day. I've set up a glue crawler pointing to bucket/inference/, and have the following exclude patterns:

**/modelling/**
**/extract/**

The logs correctly show that the bucket/inference/2022/04/28/modelling/metadata.tar.gz and bucket/inference/2022/04/28/extract/data.parquet files are being excluded, and the DDL metadata shows that it's picking up the correct number of objects and rows in the data.

However, when I go to SELECT * in Athena, I get the following error:

HIVE_BAD_DATA: Not valid Parquet file: s3://bucket/inference/2022/04/28/modelling/metadata.tar.gz expected magic number: PAR1

I've tried every combo of the above exclude patterns, but it always seems to be picking up what's in the modelling folder, despite the logs explicitly excluding it. Am I missing something here?

Many thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

呆头 2025-02-02 10:38:35

这是雅典娜的已知问题。来自AWS故障排除文档:

Athena不认识到您指定AWS胶水梁的模式。例如,如果您有一个包含.csv和.json文件的Amazon S3存储桶,并且您将.json文件从crawler中排除,则雅典娜(Athena)查询两组文件。为避免这种情况,请将要排除的文件放在其他位置。

参考:

This is a known issue with Athena. From AWS troubleshooting documentation:

Athena does not recognize exclude patterns that you specify an AWS Glue crawler. For example, if you have an Amazon S3 bucket that contains both .csv and .json files and you exclude the .json files from the crawler, Athena queries both groups of files. To avoid this, place the files that you want to exclude in a different location.

Reference: Athena reads files that I excluded from the AWS Glue crawler (AWS)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文