AWS胶水爬虫在排除模式条件下添加分区
我遇到了以下情况:假设我有以下S3结构
- S3://my_bucket/path_to_to_crawl/partition=A/some_file.parquet.parquet
- s3://my_bucket/path_to_crawl/partition=b/partition=b/some_file.parquile.parquet.parquet
- s3:// my_bucket s3:// my_bucket /path_to_crawl/partition=c/some_file.csv
i我将爬虫指向s3:// my_bucket/path_to_crawl,并且它具有以下排除模式:**。csv
。所需的输出将是一个表path_to_crawl
使用2个分区创建的s3:// my_bucket/path_to_crawl/partition = a/
and s3:// my_bucket/path_bucket/path_to_to_to_to_to_to_to_crawl/分区= b/
,但是crawler还添加了分区s3:// my_bucket/path_to_crawl/partition = c/
。但是,爬网并没有为最后一个分区创建单独的表,因为没有排除模式会创建一个单独的表。
这里的问题是,在表Path_to_crawl分区中s3:// my_bucket/path_to_crawl/partition = c/
从表中继承架构,并且在尝试通过EG ATHENA进行查询时 - 查询自然会出现错误的错误。遇到不兼容的模式。
有没有办法使排除模式也不将分区添加到表中?
I ran into the following situation: let's say I have the following s3 structure
- s3://my_bucket/path_to_crawl/partition=A/some_file.parquet
- s3://my_bucket/path_to_crawl/partition=B/some_file.parquet
- s3://my_bucket/path_to_crawl/partition=C/some_file.csv
I point my crawler to s3://my_bucket/path_to_crawl and it has the following exclude pattern defined: **.csv
. The desired output would be a table path_to_crawl
created with 2 partitions s3://my_bucket/path_to_crawl/partition=A/
and s3://my_bucket/path_to_crawl/partition=B/
, however the crawler also adds partition s3://my_bucket/path_to_crawl/partition=C/
. The crawler does not however create a separate table for the last partition which is desired as not having the exclude patter would create a separate table.
The problem here is that in table path_to_crawl partition s3://my_bucket/path_to_crawl/partition=C/
inherits schema from the table and when trying to query it through e.g. Athena - the query naturally errors out as it encounters incompatible schema.
Is there a way to make exclude patterns not add the partition to the table as well?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您必须明确排除分区的路径。
包括路径:
排除路径:
You'll have to explicitly exclude the path of the partition.
Include path:
Exclude paths: