SSIS - 数据库字段级别的批量更新

发布于 2024-08-25 16:19:56 字数 849 浏览 8 评论 0原文

这是我们的使命:

  • 接收客户的文件。每个文件包含 1 到 1,000,000 条记录。
  • 记录被加载到暂存区域并应用业务规则验证。
  • 然后,有效记录会按照以下规则以批量方式注入 OLTP 数据库:
    • 如果记录不存在(我们有密钥,所以这不是问题),请创建它。
    • 如果记录存在,可选择更新每个数据库字段。该决定是根据 3 个因素之一做出的...我认为这些因素是什么并不重要。

我们的主要问题是找到一种有效的方法来选择性地在字段级别更新数据。这适用于大约 12 个不同的数据库表,每个表中有 10 到 150 个字段(原始数据库设计还有很多不足之处,但事实就是如此)。

我们的第一个尝试是引入一个表,该表反映暂存环境(每个系统字段的暂存中有一个字段)并包含一个屏蔽标志。掩蔽标志的值代表 3 个因素。

然后我们进行了类似于...的更新...

UPDATE OLTPTable1 SET Field1 = CASE 
  WHEN Mask.Field1 = 0 THEN Staging.Field1
  WHEN Mask.Field1 = 1 THEN COALESCE( Staging.Field1 , OLTPTable1.Field1 )
  WHEN Mask.Field1 = 2 THEN COALESCE( OLTPTable1.Field1 , Staging.Field1 )
...

正如您可以想象的那样,性能相当可怕。

有人解决过类似的需求吗?

我们是一家 MS 商店,使用 Windows 服务来启动处理数据处理的 SSIS 包。不幸的是,我们在这方面几乎是新手。

Here's our mission:

  • Receive files from clients. Each file contains anywhere from 1 to 1,000,000 records.
  • Records are loaded to a staging area and business-rule validation is applied.
  • Valid records are then pumped into an OLTP database in a batch fashion, with the following rules:
    • If record does not exist (we have a key, so this isn't an issue), create it.
    • If record exists, optionally update each database field. The decision is made based on one of 3 factors...I don't believe it's important what those factors are.

Our main problem is finding an efficient method of optionally updating the data at a field level. This is applicable across ~12 different database tables, with anywhere from 10 to 150 fields in each table (original DB design leaves much to be desired, but it is what it is).

Our first attempt has been to introduce a table that mirrors the staging environment (1 field in staging for each system field) and contains a masking flag. The value of the masking flag represents the 3 factors.

We've then put an UPDATE similar to...

UPDATE OLTPTable1 SET Field1 = CASE 
  WHEN Mask.Field1 = 0 THEN Staging.Field1
  WHEN Mask.Field1 = 1 THEN COALESCE( Staging.Field1 , OLTPTable1.Field1 )
  WHEN Mask.Field1 = 2 THEN COALESCE( OLTPTable1.Field1 , Staging.Field1 )
...

As you can imagine, the performance is rather horrendous.

Has anyone tackled a similar requirement?

We're a MS shop using a Windows Service to launch SSIS packages that handle the data processing. Unfortunately, we're pretty much novices at this stuff.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

零度℉ 2024-09-01 16:19:56

如果您使用的是 SQL Server 2008,请查看 MERGE 语句,这可能适合您的 Upsert 需求。

您可以对输入使用条件拆分,根据匹配的因素将行发送到不同的处理阶段吗?听起来您可能需要对 12 个表中的每一个执行此操作,但您可能可以并行执行其中一些操作。

If you are using SQL Server 2008, look into the MERGE statement, this may be suitable for your Upsert needs here.

Can you use a Conditional Split for the input to send the rows to a different processing stage dependent upon the factor that is matched? Sounds like you may need to do this for each of the 12 tables but potentially you could do some of these in parallel.

差↓一点笑了 2024-09-01 16:19:56

我查看了合并工具,但我不确定它是否可以灵活地根据一组预定义的规则指示哪个数据源优先。

此功能对于允许多个成员利用可能有不同需求的流程的系统至关重要。

据我所知,合并函数更像是一个排序的联合。

I took a look at the merge tool, but I’m not sure it would allow for the flexibility to indicate which data source takes precedence based off of a predefined set of rules.

This function is critical to allow for a system that lets multiple members utilize the process that can have very different needs.

From what I have read the Merge function is more of a sorted union.

眼藏柔 2024-09-01 16:19:56

我们确实使用与您在我们的产品中描述的外部系统输入类似的方法。 (我们处理几百个目标表,最多有 240 列)正如您所描述的,有 1 到一百万或更多行。

一般来说,我们不会尝试设置单个批量更新,而是尝试一次处理一列的值。鉴于它们都是代表相同数据元素的单一类型,因此暂存 UPDATE 语句很简单。我们通常为映射值创建临时表,这是一个简单的

UPDATE target SET target.column = mapping.resultcolumn WHERE target.sourcecolumn =apping.sourcecolumn。

设置映射有点复杂,但我们再次这样做时一次处理一列。

我不知道你如何定义“可怕”。对于我们来说,这个过程是在批处理模式下完成的,通常是一夜之间,所以绝对性能几乎从来都不是问题。

编辑:
我们还以可配置大小的批次进行这些工作,因此工作集和COMMIT 永远不会很大。我们的默认值是批处理 1000 行,但某些特定情况受益于最多 40,000 行批处理。我们还为特定表的工作数据添加索引。

We do use an approach similar to what you describe in our product for external system inputs. (we handle a couple of hundred target tables with up to 240 columns) Like you describe, there's anywhere from 1 to a million or more rows.

Generally, we don't try to set up a single mass update, we try to handle one column's values at a time. Given that they're all a single type representing the same data element, the staging UPDATE statements are simple. We generally create scratch tables for mapping values and it's a simple

UPDATE target SET target.column = mapping.resultcolumn WHERE target.sourcecolumn = mapping.sourcecolumn.

Setting up the mappings is a little involved, but we again deal with one column at a time while doing that.

I don't know how you define 'horrendous'. For us, this process is done in batch mode, generally overnight, so absolute performance is almost never an issue.

EDIT:
We also do these in configurable-size batches, so the working sets & COMMITs are never huge. Our default is 1000 rows in a batch, but some specific situations have benefited from up to 40 000 row batches. We also add indexes to the working data for specific tables.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文