aws 胶水爬虫太慢
胶水爬虫有没有只爬取s3下某些文件夹的功能?目前,我们的管道变得越来越慢,因为我们不断有新数据进入。我们当然知道哪些文件夹是新的以及采用哪种模式。
Does glue crawlers have the function to crawl only certain folders under s3? Currently our pipeline is getting slower and slower since we continuously have new data coming in. We know certainly which folders are new and in which pattern.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Glue Crawler 可以配置为仅从 S3 源爬取特定路径(
包含路径
)。此外,如果需要,可以将爬网程序配置为排除某些文件模式(排除模式
)。示例
CreateCrawler
API 请求:参考
当新文件/文件夹添加到 S3 源中的包含路径时,Glue Crawler 可以配置为以不同的方式运行。具体来说,爬虫可以配置为仅爬行新文件/文件夹;这是
增量抓取
。注意:对于架构更改,增量爬网存在限制。花一些时间阅读 AWS 文档。它范围广泛且有点分散。
CreateCrawler
API 请求示例:参考文献
A Glue Crawler can be configured to only crawl specific paths from an S3 source (
Include path
). Additionally, if needed, a crawler can be configured to exclude certain file patterns (Exclude patterns
).Example
CreateCrawler
API request:References
A Glue Crawler can be configured to behave in different ways when new files/folders are added to the include path in an S3 source. Specifically, a crawler can be configured to crawl only new files/folders; this is an
incremental crawl
.Note: There are restrictions for incremental crawls with respect to schema changes. Take some time to read through the AWS documentation. It's extensive and a bit scattered.
Example
CreateCrawler
API request:References
有一种新方法,您可以使用 S3 存储桶事件通知来跟踪文件更改并将其放入 SQS 队列中,然后在爬网程序中指定该队列仅爬网此 SQS 队列中指示的文件。此过程记录在此处:
https:// docs.aws.amazon.com/glue/latest/dg/crawler-s3-event-notifications.html
我可以确认这有效,因为我已经在自己的项目中实现了这一点。
There is a new method where you use S3 bucket event notifications to track file changes and put them in an SQS queue, which is then specified in your crawler to only crawl files indicated in this SQS queue. This process is documented here:
https://docs.aws.amazon.com/glue/latest/dg/crawler-s3-event-notifications.html
I can confirm that this works as I have implemented this in my own project.