带简历的 SSIS 数据导入

发布于 2024-12-26 01:10:02 字数 293 浏览 3 评论 0原文

我需要将一个大型 SQL 表从本地实例推送到 SQL Azure。传输是一个简单、“干净”的上传 - 只需将数据推送到新的空表中即可。

该表非常大(约 1 亿行),并且仅包含 GUID 和其他简单类型(没有时间戳或任何内容)。

我使用 SSMS 中的数据导入/导出向导创建 SSIS 包。该包效果很好。

问题是当包通过缓慢或间歇性的连接运行时。如果互联网连接中途中断,则无法“恢复”传输。

设计 SSIS 包以可恢复方式上传此数据的最佳方法是什么?即在连接失败的情况下,或者允许作业仅在特定时间窗口之间运行。

I need to push a large SQL table from my local instance to SQL Azure. The transfer is a simple, 'clean' upload - simply push the data into a new, empty table.

The table is extremely large (~100 million rows) and consist only of GUIDs and other simple types (no timestamp or anything).

I create an SSIS package using the Data Import / Export Wizard in SSMS. The package works great.

The problem is when the package is run over a slow or intermittent connection. If the internet connection goes down halfway through, then there is no way to 'resume' the transfer.

What is the best approach to engineering an SSIS package to upload this data, in a resumable fashion? i.e. in case of connection failure, or to allow the job to be run only between specific time windows.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

攀登最高峰 2025-01-02 01:10:02

通常,在这种情况下,我会设计包来枚举大小为 N 的批次(1k 行、10M 行等),并将最后成功传输的批次记录到处理表中。但是,使用 GUID,您无法将它们完全划分到存储桶中。

在这种特殊情况下,我将修改您的数据流,使其看起来像 Source ->查找->目的地。在查找转换中,查询 Azure 端并仅检索密钥 (SELECT myGuid FROM myTable)。在这里,我们只对查找记录集中没有匹配的行感兴趣,因为这些是等待传输的行。

假设 Azure 端已完全填充,加上相关的数据传输成本,完整的缓存将消耗约 1.5GB (100M * 16bytes) 的内存。该成本将低于截断并重新传输所有数据,但只是想确保我将其调用出来。

Normally, in a situation like that, I'd design the package to enumerate through batches of size N (1k row, 10M rows, whatever) and log to a processing table what the last successful batch transmitted would be. However, with GUIDs you can't quite partition them out into buckets.

In this particular case, I would modify your data flow to look like Source -> Lookup -> Destination. In your lookup transformation, query the Azure side and only retrieve the keys (SELECT myGuid FROM myTable). Here, we're only going to be interested in rows that don't have a match in the lookup recordset as those are the ones pending transmission.

A full cache is going to cost about 1.5GB (100M * 16bytes) of memory assuming the Azure side was fully populated plus the associated data transfer costs. That cost will be less than truncating and re-transferring all the data but just want to make sure I called it out.

谁人与我共长歌 2025-01-02 01:10:02

上传时只需按您的 GUID 订购即可。并确保在从故障中恢复或重新启动时使用 Azure 的 max(guid) 作为起点。

Just order by your GUID when uploading. And make sure you use the max(guid) from Azure as your starting point when recovering from a failure or restart.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文