胶水爬行者中排除的文件夹抛出hive_bad_data错误在雅典娜

发布于 2025-01-26 10:38:35 字数 985 浏览 1 评论 0 原文

我正在尝试创建一个胶水爬网来爬行特定的路径模式。我有以下路径：

bucket/inference/2022/04/28/modelling/metadata.tar.gz
bucket/inference/2022/04/28/prediction/predictions.parquet
bucket/inference/2022/04/28/extract/data.parquet

每天都重复相同的模式，即，我们

bucket/inference/2022/04/29/*
bucket/inference/2022/04/30/*

只想抓取 **/precoverions 文件夹中的内容。我已经设置了一个指向 bucket/theprence/的胶水横梁，并具有以下排除模式：

**/modelling/**
**/extract/**

日志正确地显示 bucket/teberperion/temperion/2022/04/28/Moduleing/nodeing/28/ getadata.tar.gz 和 bucket/theperion/2022/28/28/extract/data.parquet 文件被排除在外，DDL元数据显示它正在拾取正确的数量数据中的对象和行。

但是，当我转到 select * 在雅典娜时，我会收到以下错误：

HIVE_BAD_DATA: Not valid Parquet file: s3://bucket/inference/2022/04/28/modelling/metadata.tar.gz expected magic number: PAR1

我尝试了上述每个组合的排除模式，但是它似乎总是在拾取建模文件夹中的内容，尽管该日志明确排除了它。我在这里错过了什么吗？

非常感谢。

原文

I'm trying to create a glue crawler to crawl a specific path pattern. I have the following paths:

bucket/inference/2022/04/28/modelling/metadata.tar.gz
bucket/inference/2022/04/28/prediction/predictions.parquet
bucket/inference/2022/04/28/extract/data.parquet

The same pattern is repeated every day, i.e. we have the above for

bucket/inference/2022/04/29/*
bucket/inference/2022/04/30/*

I only want to crawl what's in the **/predictions folders each day. I've set up a glue crawler pointing to bucket/inference/, and have the following exclude patterns:

**/modelling/**
**/extract/**

The logs correctly show that the bucket/inference/2022/04/28/modelling/metadata.tar.gz and bucket/inference/2022/04/28/extract/data.parquet files are being excluded, and the DDL metadata shows that it's picking up the correct number of objects and rows in the data.

However, when I go to SELECT * in Athena, I get the following error:

HIVE_BAD_DATA: Not valid Parquet file: s3://bucket/inference/2022/04/28/modelling/metadata.tar.gz expected magic number: PAR1

I've tried every combo of the above exclude patterns, but it always seems to be picking up what's in the modelling folder, despite the logs explicitly excluding it. Am I missing something here?

Many thanks.

分享到QQ

分享到微博