如何捕获awsglue中的数据变化?

发布于 2025-01-20 09:11:54 字数 125 浏览 4 评论 0原文

我们在本地 sql-server 中有源数据。我们使用 AWSglue 从 sql-server 获取数据并将其放置到 S3。谁能帮助我们如何在 AWS Glue 中实现变更数据捕获?

注意-我们不想使用 AWS DMS。

We have source data in on premise sql-server. We are using AWS glue to fetch data from sql-server and place it to the S3. Could anyone please help how can we implement change data capture in AWS Glue?

Note- We don't want to use AWS DMS.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

待"谢繁草 2025-01-27 09:11:54

您可以利用 AWS DMS for CDC,然后使用 Apache IceBerg 与 Glue Data Catalog 连接来实现此目的:
https://aws.amazon.com/blogs/big-data/implement-a-cdc-based-upsert-in-a-data-lake-using-apache-iceberg-and-aws-glue/< /a>

You can leverage AWS DMS for CDC and then use the Apache IceBerg connections with Glue Data Catalog to achieve this:
https://aws.amazon.com/blogs/big-data/implement-a-cdc-based-upsert-in-a-data-lake-using-apache-iceberg-and-aws-glue/

榆西 2025-01-27 09:11:54

我只知道粘合书签。他们将为您提供新的记录(插入),但不会帮助您使用通常使用真正的CDC解决方案获得的更新和删除。

不确定您的用例,但是您可以查看以下项目。它具有相当有效的差异功能,并且有了正确的选项,可以为您提供类似CDC的输出

https://github.com/g-research/spark-extension/blob/master/master/diff.md.md

I'm only aware of Glue Bookmarks. They will help you with the new records (Inserts), but won't help you with the Updates and Deletes that you typically get with a true CDC solution.

Not sure of your use case, but you could check out the following project. It has a pretty efficient diff feature and, with the right options, can give you a CDC-like output

https://github.com/G-Research/spark-extension/blob/master/DIFF.md

假装爱人 2025-01-27 09:11:54

不可能通过直接粘合数据提取来实现变更数据捕获。虽然作业书签可以帮助您识别插入和更新(如果您的表包含 update_at 时间戳列),但它不会涵盖删除情况。您实际上需要一个 CDC 解决方案。

虽然 AWS 胶水直接连接到数据库源是一个很好的解决方案,但由于成本问题,我强烈建议不要使用它来进行增量数据提取。这就像用卡车运送一瓶饮用水一样。

正如您已经评论的那样,我也不喜欢 AWS DMS,但更喜欢强大的 CDC 解决方案,例如 Debezium 可能是一个完美的解决方案。它与 kafka 和 Kinesis 集成。您可以轻松地将流直接下沉到 s3。 Debezium 使您能够捕获删除并附加特殊的布尔值 < code>__delete 列添加到您的数据中,以便您的glue etl 可以使用此字段管理这些已删除记录的删除。

It's not possible to implement a change data capture through direct glue data extraction. While a Job bookmark can help you identify inserts and updates if your table contains an update_at timestamp column, it won't cover delete cases. You actually need a CDC solution.

While AWS glue direct connection to a database source is a great solution, I strongly discourage using it for incremental data extraction due to the cost implication. It's like using a Truck to ship one bottle of table water.

As you already commented, I am not also a fan of AWS DMS, but for a robust CDC solution, a tool like Debezium could be a perfect solution. It integrates with kafka and Kinesis. You can easily sink the stream to s3 directly. Debezium gives you the possibility to capture deletes and append a special boolean __delete column to your data, so your glue etl can manage the removal of these deleted records with this field.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文