aws 胶水爬虫太慢

发布于 2025-01-12 21:37:37 字数 80 浏览 4 评论 0原文

胶水爬虫有没有只爬取s3下某些文件夹的功能？目前，我们的管道变得越来越慢，因为我们不断有新数据进入。我们当然知道哪些文件夹是新的以及采用哪种模式。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

零時差 2025-01-19 21:37:37

Glue Crawler 可以配置为仅从 S3 源爬取特定路径（包含路径）。此外，如果需要，可以将爬网程序配置为排除某些文件模式（排除模式）。

示例 CreateCrawler API 请求：

{
   ...
   "Targets": { 
      "S3Targets": [ 
         { 
            "ConnectionName": "string",
            "Exclusions": [
               "file_pattern_to_exclude_1",  // <-- Exclude patterns
               "file_pattern_to_exclude_2",
            ],
            "Path": "s3://<bucket>/path/to/include",  // <-- Include path
            "SampleSize": number
         },
         {
            ...
         }
      ]
   },
   ...
}

参考

当新文件/文件夹添加到 S3 源中的包含路径时，Glue Crawler 可以配置为以不同的方式运行。具体来说，爬虫可以配置为仅爬行新文件/文件夹；这是增量抓取。

注意：对于架构更改，增量爬网存在限制。花一些时间阅读 AWS 文档。它范围广泛且有点分散。

CreateCrawler API 请求示例：

{
   ...
   "RecrawlPolicy": {
      "RecrawlBehavior": "CRAWL_NEW_FOLDERS_ONLY"
   },
   "SchemaChangePolicy": {
      "UpdateBehavior": "LOG",
      "DeleteBehavior": "LOG",
   }
   ...
}

参考文献

A Glue Crawler can be configured to only crawl specific paths from an S3 source (Include path). Additionally, if needed, a crawler can be configured to exclude certain file patterns (Exclude patterns).

Example CreateCrawler API request:

{
   ...
   "Targets": { 
      "S3Targets": [ 
         { 
            "ConnectionName": "string",
            "Exclusions": [
               "file_pattern_to_exclude_1",  // <-- Exclude patterns
               "file_pattern_to_exclude_2",
            ],
            "Path": "s3://<bucket>/path/to/include",  // <-- Include path
            "SampleSize": number
         },
         {
            ...
         }
      ]
   },
   ...
}

References

A Glue Crawler can be configured to behave in different ways when new files/folders are added to the include path in an S3 source. Specifically, a crawler can be configured to crawl only new files/folders; this is an incremental crawl.

Note: There are restrictions for incremental crawls with respect to schema changes. Take some time to read through the AWS documentation. It's extensive and a bit scattered.

Example CreateCrawler API request:

{
   ...
   "RecrawlPolicy": {
      "RecrawlBehavior": "CRAWL_NEW_FOLDERS_ONLY"
   },
   "SchemaChangePolicy": {
      "UpdateBehavior": "LOG",
      "DeleteBehavior": "LOG",
   }
   ...
}

References

回复收藏 0 原文