使用 SSIS 包导入大型数据集时如何执行查找？

发布于 2024-11-16 00:55:19 字数 251 浏览 0 评论 0原文

我有一个问题 - 我希望通过 SSIS (Microsoft ETL) 下载客户数据。对于我导入的每条客户记录，我必须检查目标数据库中的现有客户记录。如果记录存在，我必须更新记录，如果不存在，我必须插入它。获得最佳性能的策略是什么？我们谈论的是 20-3000 万行。

我想到了触发器（INSERT 上触发 - 检查记录是否存在）、查找、游标，但我确信有更好的方法，可能使用 SSIS 中的数据流转换之一？

或者我将其导入临时表然后在本地进行查找是否更好？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

韵柒 2024-11-23 00:55:19

这是一种可能的选项，您可以应用它来将数据加载到目标表中，无论是否涉及插入/更新。下面给出的示例在 3 分钟内将包含 100 万行的文本文件加载到 SQL 表中。如果该记录在 SQL 表中不存在，那么它将被插入到目标表中，否则该记录将被插入到临时表中，然后该记录将用于在存储过程的帮助下更新目标表。该存储过程还将删除 SQL 表中可能存在但平面文件中不存在的任何记录。该示例使用 SSIS 2008 R2 和 SQL Server 2008 R2 后端。

分步过程：

在 SQL Server 数据库中，创建两个表，即 dbo.ItemInfo 和 dbo.Staging。创建表查询可在脚本部分下找到。这些表的结构如屏幕截图#1所示。 ItemInfo 将保存实际数据，Staging 表将保存暂存数据以比较和更新实际记录。这两个表中的 Id 列都是自动生成的唯一标识列。表 ItemInfo 中的 IsProcessed 列将用于识别和删除不再有效的记录。
创建一个 SSIS 包并创建 5 个变量，如屏幕截图 #2 所示。我对制表符分隔文件使用了 .txt 扩展名，因此变量 FileExtension 中的值是 *.txt。 FilePath 变量将在运行时被赋值。 FolderLocation 变量表示文件所在的位置。 SQLPostLoad 和 SQLPreLoad 变量表示预加载和后加载操作期间使用的存储过程。脚本部分下提供了这些存储过程的脚本。
创建指向 SQL Server 数据库的 OLE DB 连接。创建平面文件连接，如屏幕截图 #3 和 #4 所示。 平面文件连接列部分包含列级别信息。屏幕截图#5 显示列数据预览。
配置控制流任务，如屏幕截图 #6 所示。配置任务预加载、后加载和循环文件，如屏幕截图所示#7 - # 10。预加载将截断临时表并将 ItemInfo 表中所有行的 IsProcessed 标志设置为 false。加载后将更新更改并删除数据库中文件中未找到的行。请参阅这些任务中使用的存储过程，了解这些执行 SQL 任务中正在执行的操作。
双击“加载项目”数据流任务并对其进行配置，如屏幕截图 #11 所示。 Read File 是配置为使用平面文件连接的平面文件源。 Row Count 是派生列转换，其配置如 screenshto #12 所示。 Check Exist 是一个查找转换，其配置如屏幕截图 #13 - #15 所示。 查找不匹配输出被重定向到左侧的目标拆分。 查找匹配输出被重定向到左侧的暂存分割。 Destination Split 和 Staging Split 具有与屏幕截图 #16 中所示完全相同的配置。目标和临时表有 9 个不同目标的原因是为了提高包的性能。
所有目标任务 0 - 8 都配置为将数据插入表 dbo.ItemInfo，如屏幕截图 #17 所示。所有暂存任务 0 - 8 均配置为将数据插入到 dbo.Staging 中，如屏幕截图 #18 所示。
在平面文件连接管理器上，将 ConnectionString 属性设置为使用变量 FilePath，如屏幕截图 #19 所示。这将使包能够在循环遍历文件夹中的每个文件时使用变量中设置的值。

测试场景：

Test results may vary from machine to machine. 
In this scenario, file was located locally on the machine. 
Files on network might perform slower. 
This is provided just to give you an idea. 
So, please take these results with grain of salt.

程序包在具有 Xeon 单核 CPU 2.5GHz 和 3.00 GB RAM 的 64 位计算机上执行。
加载了一个包含100万行的平面文件。包执行时间约为2 分 47 秒。请参阅屏幕截图 #20 和 #21。
使用测试查询部分下提供的查询来修改数据，以在包的第二次运行期间模拟更新、删除和创建新记录。
在数据库中执行以下查询后，加载包含100万行的同一文件。包执行时间约为1 分 35 秒。请参阅屏幕截图 #22 和 #23。请注意屏幕截图中重定向到目标和暂存表的行数 #22。

上述答案是提供给 这个堆栈溢出问题也是如此。

希望有帮助。

测试查询：
。

--These records will be deleted during next run 
--because item ids won't match with file data.
--(111111 row(s) affected)
UPDATE dbo.ItemInfo SET ItemId = 'DEL_' + ItemId WHERE Id % 9 IN (3)

--These records will be modified to their original item type of 'General'
--because that is the data present in the file.
--(222222 row(s) affected)
UPDATE dbo.ItemInfo SET ItemType = 'Testing' + ItemId WHERE Id % 9 IN (2,6)

--These records will be reloaded into the table from the file.
--(111111 row(s) affected)
DELETE FROM dbo.ItemInfo WHERE Id % 9 IN (5,9)

平面文件连接列
。

Name        InputColumnWidth     DataType          OutputColumnWidth
----------  ----------------     ---------------   -----------------
Id          8                    string [DT_STR]   8
ItemId      11                   string [DT_STR]   11
ItemName    21                   string [DT_STR]   21
ItemType    9                    string [DT_STR]   9

脚本： （创建表和存储过程）
。

CREATE TABLE [dbo].[ItemInfo](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [ItemId] [varchar](255) NOT NULL,
    [ItemName] [varchar](255) NOT NULL,
    [ItemType] [varchar](255) NOT NULL,
    [IsProcessed] [bit] NULL,
    CONSTRAINT [PK_ItemInfo] PRIMARY KEY CLUSTERED ([Id] ASC),
    CONSTRAINT [UK_ItemInfo_ItemId] UNIQUE NONCLUSTERED ([ItemId] ASC)) ON [PRIMARY]
GO

CREATE TABLE [dbo].[Staging](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [ItemId] [varchar](255) NOT NULL,
    [ItemName] [varchar](255) NOT NULL,
    [ItemType] [varchar](255) NOT NULL,
 CONSTRAINT [PK_Staging] PRIMARY KEY CLUSTERED ([Id] ASC)) ON [PRIMARY]
GO

CREATE PROCEDURE [dbo].[PostLoad]
AS
BEGIN
    SET NOCOUNT ON;

    UPDATE      ITM
    SET         ITM.ItemName    = STG.ItemName
            ,   ITM.ItemType    = STG.ItemType 
            ,   ITM.IsProcessed = 1
    FROM        dbo.ItemInfo    ITM
    INNER JOIN  dbo.Staging     STG
    ON          ITM.ItemId      = STG.ItemId;

    DELETE FROM dbo.ItemInfo
    WHERE       IsProcessed = 0;
END
GO

CREATE PROCEDURE [dbo].[PreLoad]
AS
BEGIN
    SET NOCOUNT ON;

    TRUNCATE TABLE dbo.Staging;     

    UPDATE  dbo.ItemInfo 
    SET     IsProcessed = 0;
END
GO

屏幕截图 #1：

屏幕截图 #2：

屏幕截图 #3：

屏幕截图 #4：

屏幕截图 #5：

屏幕截图 #6：

屏幕截图 #7：

屏幕截图 #8：

屏幕截图#9：

屏幕截图 #10:

屏幕截图 #11:< /strong>

屏幕截图 #12：

屏幕截图 #13：

屏幕截图 #14：

屏幕截图#15:

屏幕截图 #16:

屏幕截图 #17:

“17”

屏幕截图 #18：

屏幕截图 #19：

屏幕截图 #20：

屏幕截图 #21：

屏幕截图 #22：

屏幕截图＃23：

Here is one possible option that you can apply to load data into destination table whether it involves insert/update. The example given below loads a text file containing 1 million rows into an SQL table in under 3 minutes. If the record doesn't exist in the SQL table then it will be inserted into destination table else the record will be inserted into a staging table which will then be used to update the destination table with the help of a stored procedure. The stored procedure will also delete any records that might be present in SQL table but not present in the flat file. The example uses SSIS 2008 R2 with SQL Server 2008 R2 backend.

Step-by-step process:

In the SQL Server database, create two tables namely dbo.ItemInfo and dbo.Staging. Create table queries are available under Scripts section. Structure of these tables are shown in screenshot #1. ItemInfo will hold the actual data and Staging table will hold the staging data to compare and update the actual records. Id column in both these tables is an auto-generated unique identity column. IsProcessed column in the table ItemInfo will be used to identify and delete the records that are no longer valid.
Create an SSIS package and create 5 variables as shown in screenshot #2. I have used .txt extension for the tab delimited files and hence the value *.txt in the variable FileExtension. FilePath variable will be assigned with value during run-time. FolderLocation variable denotes where the files will be located. SQLPostLoad and SQLPreLoad variables denote the stored procedures used during the pre-load and post-load operations. Scripts for these stored procedures are provided under the Scripts section.
Create an OLE DB connection pointing to the SQL Server database. Create a flat file connection as shown in screenshots #3 and #4. Flat File Connection Columns section contains column level information. Screenshot #5 shows the columns data preview.
Configure the Control Flow Task as shown in screenshot #6. Configure the tasks Pre Load, Post Load and Loop Files as shown in screenshots #7 - #10. Pre Load will truncate staging table and set IsProcessed flag to false for all rows in ItemInfo table. Post Load will update the changes and will delete rows in database that are not found in the file. Refer the stored procedures used in those tasks to understand what is being done in these Execute SQL tasks.
Double-click on the Load Items data flow task and configure it as shown in screenshot #11. Read File is a flat file source configured to use the flat file connection. Row Count is derived column transformation and its configuration is shown in screenshto #12. Check Exist is a lookup transformation and its configurations are shown in screenshots #13 - #15. Lookup No Match Output is redirected to Destination Split on the left side. Lookup Match Output is redirected to Staging Split on the left side. Destination Split and Staging Split have the exact same configuration as shown in screenshot #16. The reason for 9 different destinations for both destination and staging table is to improve the performance of the package.
All the destination tasks 0 - 8 are configured to insert data into table dbo.ItemInfo as shown in screenshot #17. All the staging tasks 0 - 8 are configured to insert data into dbo.Staging as shown in screenshot #18.
On the Flat File connection manager, set the ConnectionString property to use the variable FilePath as shown in screenshot #19. This will enable the package to use the value set in the variable as it loops through each file in a folder.

Test scenarios:

Test results may vary from machine to machine. 
In this scenario, file was located locally on the machine. 
Files on network might perform slower. 
This is provided just to give you an idea. 
So, please take these results with grain of salt.

Package was executed on a 64-bit machine with Xeon single core CPU 2.5GHz and 3.00 GB RAM.
Loaded a flat file with 1 million rows. Package executed in about 2 mins 47 seconds. Refer screenshots #20 and #21.
Used the queries provided under Test queries section to modify the data to simulate update, delete and creation of new records during the second run of the package.
Loaded the same file containing the 1 million rows after the following queries were executed in the database. Package executed in about 1 min 35 seconds. Refer screenshots #22 and #23. Please note the number of rows redirected to destination and staging table in screenshot #22.

The above answer is copy of the answer provided to this Stack Overflow question as well.

Hope that helps.

Test queries:
.

--These records will be deleted during next run 
--because item ids won't match with file data.
--(111111 row(s) affected)
UPDATE dbo.ItemInfo SET ItemId = 'DEL_' + ItemId WHERE Id % 9 IN (3)

--These records will be modified to their original item type of 'General'
--because that is the data present in the file.
--(222222 row(s) affected)
UPDATE dbo.ItemInfo SET ItemType = 'Testing' + ItemId WHERE Id % 9 IN (2,6)

--These records will be reloaded into the table from the file.
--(111111 row(s) affected)
DELETE FROM dbo.ItemInfo WHERE Id % 9 IN (5,9)

Flat File Connection Columns
.

Name        InputColumnWidth     DataType          OutputColumnWidth
----------  ----------------     ---------------   -----------------
Id          8                    string [DT_STR]   8
ItemId      11                   string [DT_STR]   11
ItemName    21                   string [DT_STR]   21
ItemType    9                    string [DT_STR]   9

Scripts: (to create both tables and stored procedures)
.

CREATE TABLE [dbo].[ItemInfo](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [ItemId] [varchar](255) NOT NULL,
    [ItemName] [varchar](255) NOT NULL,
    [ItemType] [varchar](255) NOT NULL,
    [IsProcessed] [bit] NULL,
    CONSTRAINT [PK_ItemInfo] PRIMARY KEY CLUSTERED ([Id] ASC),
    CONSTRAINT [UK_ItemInfo_ItemId] UNIQUE NONCLUSTERED ([ItemId] ASC)) ON [PRIMARY]
GO

CREATE TABLE [dbo].[Staging](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [ItemId] [varchar](255) NOT NULL,
    [ItemName] [varchar](255) NOT NULL,
    [ItemType] [varchar](255) NOT NULL,
 CONSTRAINT [PK_Staging] PRIMARY KEY CLUSTERED ([Id] ASC)) ON [PRIMARY]
GO

CREATE PROCEDURE [dbo].[PostLoad]
AS
BEGIN
    SET NOCOUNT ON;

    UPDATE      ITM
    SET         ITM.ItemName    = STG.ItemName
            ,   ITM.ItemType    = STG.ItemType 
            ,   ITM.IsProcessed = 1
    FROM        dbo.ItemInfo    ITM
    INNER JOIN  dbo.Staging     STG
    ON          ITM.ItemId      = STG.ItemId;

    DELETE FROM dbo.ItemInfo
    WHERE       IsProcessed = 0;
END
GO

CREATE PROCEDURE [dbo].[PreLoad]
AS
BEGIN
    SET NOCOUNT ON;

    TRUNCATE TABLE dbo.Staging;     

    UPDATE  dbo.ItemInfo 
    SET     IsProcessed = 0;
END
GO

Screenshot #1:

Screenshot #2:

Screenshot #3: