aws 胶水爬虫太慢

发布于 2025-01-12 21:37:37 字数 80 浏览 0 评论 0原文

胶水爬虫有没有只爬取s3下某些文件夹的功能?目前,我们的管道变得越来越慢,因为我们不断有新数据进入。我们当然知道哪些文件夹是新的以及采用哪种模式。

Does glue crawlers have the function to crawl only certain folders under s3? Currently our pipeline is getting slower and slower since we continuously have new data coming in. We know certainly which folders are new and in which pattern.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

零時差 2025-01-19 21:37:37

Glue Crawler 可以配置为仅从 S3 源爬取特定路径(包含路径)。此外,如果需要,可以将爬网程序配置为排除某些文件模式(排除模式)。

示例 CreateCrawler API 请求:

{
   ...
   "Targets": { 
      "S3Targets": [ 
         { 
            "ConnectionName": "string",
            "Exclusions": [
               "file_pattern_to_exclude_1",  // <-- Exclude patterns
               "file_pattern_to_exclude_2",
            ],
            "Path": "s3://<bucket>/path/to/include",  // <-- Include path
            "SampleSize": number
         },
         {
            ...
         }
      ]
   },
   ...
}

参考

  1. 爬网程序属性 (AWS)
  2. CreateCrawler Glue Web API (AWS)
  3. S3Target Glue Web API (AWS)

当新文件/文件夹添加到 S3 源中的包含路径时,Glue Crawler 可以配置为以不同的方式运行。具体来说,爬虫可以配置为仅爬行新文件/文件夹;这是增量抓取

注意:对于架构更改,增量爬网存在限制。花一些时间阅读 AWS 文档。它范围广泛且有点分散。

CreateCrawler API 请求示例:

{
   ...
   "RecrawlPolicy": {
      "RecrawlBehavior": "CRAWL_NEW_FOLDERS_ONLY"
   },
   "SchemaChangePolicy": {
      "UpdateBehavior": "LOG",
      "DeleteBehavior": "LOG",
   }
   ...
}

参考文献

  1. AWS Glue (AWS) 中的增量爬网
  2. 设置爬网程序配置选项 (AWS)
  3. RecrawlPolicy Glue Web API (AWS)
  4. SchemaChangePolicy Glue Web API (AWS)

A Glue Crawler can be configured to only crawl specific paths from an S3 source (Include path). Additionally, if needed, a crawler can be configured to exclude certain file patterns (Exclude patterns).

Example CreateCrawler API request:

{
   ...
   "Targets": { 
      "S3Targets": [ 
         { 
            "ConnectionName": "string",
            "Exclusions": [
               "file_pattern_to_exclude_1",  // <-- Exclude patterns
               "file_pattern_to_exclude_2",
            ],
            "Path": "s3://<bucket>/path/to/include",  // <-- Include path
            "SampleSize": number
         },
         {
            ...
         }
      ]
   },
   ...
}

References

  1. Crawler Properties (AWS)
  2. CreateCrawler Glue Web API (AWS)
  3. S3Target Glue Web API (AWS)

A Glue Crawler can be configured to behave in different ways when new files/folders are added to the include path in an S3 source. Specifically, a crawler can be configured to crawl only new files/folders; this is an incremental crawl.

Note: There are restrictions for incremental crawls with respect to schema changes. Take some time to read through the AWS documentation. It's extensive and a bit scattered.

Example CreateCrawler API request:

{
   ...
   "RecrawlPolicy": {
      "RecrawlBehavior": "CRAWL_NEW_FOLDERS_ONLY"
   },
   "SchemaChangePolicy": {
      "UpdateBehavior": "LOG",
      "DeleteBehavior": "LOG",
   }
   ...
}

References

  1. Incremental Crawls in AWS Glue (AWS)
  2. Setting Crawler Configuration Options (AWS)
  3. RecrawlPolicy Glue Web API (AWS)
  4. SchemaChangePolicy Glue Web API (AWS)
浅暮の光 2025-01-19 21:37:37

有一种新方法,您可以使用 S3 存储桶事件通知来跟踪文件更改并将其放入 SQS 队列中,然后在爬网程序中指定该队列仅爬网此 SQS 队列中指示的文件。此过程记录在此处:

https:// docs.aws.amazon.com/glue/latest/dg/crawler-s3-event-notifications.html

我可以确认这有效,因为我已经在自己的项目中实现了这一点。

There is a new method where you use S3 bucket event notifications to track file changes and put them in an SQS queue, which is then specified in your crawler to only crawl files indicated in this SQS queue. This process is documented here:

https://docs.aws.amazon.com/glue/latest/dg/crawler-s3-event-notifications.html

I can confirm that this works as I have implemented this in my own project.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文