AWS胶水爬虫在排除模式条件下添加分区

发布于 2025-02-09 15:01:01 字数 856 浏览 2 评论 0原文

我遇到了以下情况:假设我有以下S3结构

  • S3://my_bucket/path_to_to_crawl/partition=A/some_file.parquet.parquet
  • s3://my_bucket/path_to_crawl/partition=b/partition=b/some_file.parquile.parquet.parquet
  • s3:// my_bucket s3:// my_bucket /path_to_crawl/partition=c/some_file.csv

i我将爬虫指向s3:// my_bucket/path_to_crawl,并且它具有以下排除模式:**。csv。所需的输出将是一个表path_to_crawl使用2个分区创建的s3:// my_bucket/path_to_crawl/partition = a/ and s3:// my_bucket/path_bucket/path_to_to_to_to_to_to_to_crawl/分区= b/,但是crawler还添加了分区s3:// my_bucket/path_to_crawl/partition = c/。但是,爬网并没有为最后一个分区创建单独的表,因为没有排除模式会创建一个单独的表。

这里的问题是,在表Path_to_crawl分区中s3:// my_bucket/path_to_crawl/partition = c/从表中继承架构,并且在尝试通过EG ATHENA进行查询时 - 查询自然会出现错误的错误。遇到不兼容的模式。

有没有办法使排除模式也不将分区添加到表中?

I ran into the following situation: let's say I have the following s3 structure

  • s3://my_bucket/path_to_crawl/partition=A/some_file.parquet
  • s3://my_bucket/path_to_crawl/partition=B/some_file.parquet
  • s3://my_bucket/path_to_crawl/partition=C/some_file.csv

I point my crawler to s3://my_bucket/path_to_crawl and it has the following exclude pattern defined: **.csv. The desired output would be a table path_to_crawl created with 2 partitions s3://my_bucket/path_to_crawl/partition=A/ and s3://my_bucket/path_to_crawl/partition=B/, however the crawler also adds partition s3://my_bucket/path_to_crawl/partition=C/. The crawler does not however create a separate table for the last partition which is desired as not having the exclude patter would create a separate table.

The problem here is that in table path_to_crawl partition s3://my_bucket/path_to_crawl/partition=C/ inherits schema from the table and when trying to query it through e.g. Athena - the query naturally errors out as it encounters incompatible schema.

Is there a way to make exclude patterns not add the partition to the table as well?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

沙沙粒小 2025-02-16 15:01:01

是否有一种方法可以排除模式,也不将分区添加到表中?

您必须明确排除分区的路径。

包括路径:

s3://my_bucket/path_to_crawl

排除路径:

**.csv
partition=C/*

Is there a way to make exclude patterns not add the partition to the table as well?

You'll have to explicitly exclude the path of the partition.

Include path:

s3://my_bucket/path_to_crawl

Exclude paths:

**.csv
partition=C/*
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文