我正在尝试创建一个胶水爬网来爬行特定的路径模式。我有以下路径:
bucket/inference/2022/04/28/modelling/metadata.tar.gz
bucket/inference/2022/04/28/prediction/predictions.parquet
bucket/inference/2022/04/28/extract/data.parquet
每天都重复相同的模式,即,我们
bucket/inference/2022/04/29/*
bucket/inference/2022/04/30/*
只想抓取 **/precoverions
文件夹中的内容。我已经设置了一个指向 bucket/theprence/
的胶水横梁,并具有以下排除模式:
**/modelling/**
**/extract/**
日志正确地显示 bucket/teberperion/temperion/2022/04/28/Moduleing/nodeing/28/ getadata.tar.gz
和 bucket/theperion/2022/28/28/extract/data.parquet
文件被排除在外,DDL元数据显示它正在拾取正确的数量数据中的对象和行。
但是,当我转到 select *
在雅典娜时,我会收到以下错误:
HIVE_BAD_DATA: Not valid Parquet file: s3://bucket/inference/2022/04/28/modelling/metadata.tar.gz expected magic number: PAR1
我尝试了上述每个组合的排除模式,但是它似乎总是在拾取建模文件夹中的内容,尽管该日志明确排除了它。我在这里错过了什么吗?
非常感谢。
I'm trying to create a glue crawler to crawl a specific path pattern. I have the following paths:
bucket/inference/2022/04/28/modelling/metadata.tar.gz
bucket/inference/2022/04/28/prediction/predictions.parquet
bucket/inference/2022/04/28/extract/data.parquet
The same pattern is repeated every day, i.e. we have the above for
bucket/inference/2022/04/29/*
bucket/inference/2022/04/30/*
I only want to crawl what's in the **/predictions
folders each day. I've set up a glue crawler pointing to bucket/inference/
, and have the following exclude patterns:
**/modelling/**
**/extract/**
The logs correctly show that the bucket/inference/2022/04/28/modelling/metadata.tar.gz
and bucket/inference/2022/04/28/extract/data.parquet
files are being excluded, and the DDL metadata shows that it's picking up the correct number of objects and rows in the data.
However, when I go to SELECT *
in Athena, I get the following error:
HIVE_BAD_DATA: Not valid Parquet file: s3://bucket/inference/2022/04/28/modelling/metadata.tar.gz expected magic number: PAR1
I've tried every combo of the above exclude patterns, but it always seems to be picking up what's in the modelling folder, despite the logs explicitly excluding it. Am I missing something here?
Many thanks.
发布评论
评论(1)
这是雅典娜的已知问题。来自AWS故障排除文档:
参考:
This is a known issue with Athena. From AWS troubleshooting documentation:
Reference: Athena reads files that I excluded from the AWS Glue crawler (AWS)