如何使用 SSIS 将大型平面文件加载到数据库表中?

发布于 2024-11-08 15:10:28 字数 600 浏览 0 评论 0原文

我不确定它是如何工作的,所以我正在寻找正确的解决方案。我认为 SSIS 是正确的方法,但我以前从未使用过它

场景:

每天早上,我都会收到一个包含 800K 记录的制表符分隔文件。我需要将其加载到我的数据库中:

  1. 从 ftp 或本地获取文件
  2. 首先,我需要从数据库中删除新文件中不存在的文件;
    • 如何比较 tsql 中的数据
    • 我应该从哪里加载制表符分隔文件中的数据以便将其与文件进行比较?我应该使用临时表吗? ItemID 是表中的唯一列。
  3. 其次,我只需将新记录插入数据库。
  4. 当然,它应该是自动化的。
  5. 这应该是一种有效的方法,不会使 SQL 数据库过热。

不要忘记该文件包含 800K 记录。

平面文件数据示例:

ID  ItemID  ItemName  ItemType
--  ------  --------  --------
 1  2345    Apple     Fruit
 2  4578    Banana    Fruit

如何解决这个问题?

I'm not sure how it works so I'm looking for the right solution. I think SSIS is the right way to go but I have never used it before

Scenario:

Every morning, I get a tab delimited file with 800K records. I need to load it into my database:

  1. Get file from ftp or local
  2. First, I need to delete the one which not exists in new file from database;
    • How can I compare data in tsql
    • Where should I load data from tab delimited file in order to compare it with the file? Should I use a temp table? ItemID is the unique column in the table.
  3. Second, I need to insert only the new records into the database.
  4. Of course, it should be automated.
  5. It should be efficient way without overheating SQL Database

Don't forget that the file contains 800K records.

Sample flat file data:

ID  ItemID  ItemName  ItemType
--  ------  --------  --------
 1  2345    Apple     Fruit
 2  4578    Banana    Fruit

How can I approach this problem?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

不可一世的女人 2024-11-15 15:10:28

是的,SSIS 可以执行您在问题中指定的要求。以下示例应该让您了解如何完成它。示例使用 SQL Server 作为后端。下面提供了对包执行的一些基本测试场景。抱歉回答太长。

分步过程:

  1. 在 SQL Server 数据库中,创建两个表,即 dbo.ItemInfodbo.Staging。创建表查询可在脚本部分下找到。这些表的结构如屏幕截图#1所示。 ItemInfo 将保存实际数据,Staging 表将保存暂存数据以比较和更新实际记录。这两个表中的 Id 列都是自动生成的唯一标识列。表 ItemInfo 中的 IsProcessed 列将用于识别和删除不再有效的记录。

  2. 创建一个 SSIS 包并创建 5 个变量,如屏幕截图 #2 所示。我对制表符分隔文件使用了 .txt 扩展名,因此变量 FileExtension 中的值是 *.txtFilePath 变量将在运行时被赋值。 FolderLocation 变量表示文件所在的位置。 SQLPostLoadSQLPreLoad 变量表示预加载和后加载操作期间使用的存储过程。 脚本部分下提供了这些存储过程的脚本。

  3. 创建指向 SQL Server 数据库的 OLE DB 连接。创建平面文件连接,如屏幕截图 #3 和 #4 所示。 平面文件连接列部分包含列级别信息。屏幕截图 #5 显示列数据预览。

  4. 配置控制流任务,如屏幕截图 #6 所示。配置任务预加载后加载循环文件,如屏幕截图所示#7 - # 10。预加载将截断临时表并将 ItemInfo 表中所有行的 IsProcessed 标志设置为 false。加载后将更新更改并删除数据库中文件中未找到的行。请参阅这些任务中使用的存储过程,了解这些执行 SQL 任务中正在执行的操作。

  5. 双击“加载项目”数据流任务并对其进行配置,如屏幕截图 #11 所示。 Read File 是配置为使用平面文件连接的平面文件源。 Row Count 是派生列转换,其配置如 screenshto #12 所示。 Check Exist 是一个查找转换,其配置如屏幕截图 #13 - #15 所示。 查找不匹配输出被重定向到左侧的目标拆分查找匹配输出被重定向到左侧的暂存分割Destination SplitStaging Split 具有与屏幕截图 #16 中所示完全相同的配置。目标和临时表有 9 个不同目标的原因是为了提高包的性能。

  6. 所有目标任务 0 - 8 都配置为将数据插入表 dbo.ItemInfo,如屏幕截图 #17 所示。所有暂存任务 0 - 8 均配置为将数据插入到 dbo.Staging 中,如屏幕截图 #18 所示。

  7. 在平面文件连接管理器上,将 ConnectionString 属性设置为使用变量 FilePath,如屏幕截图 #19 所示。这将使包能够在循环遍历文件夹中的每个文件时使用变量中设置的值。

测试场景:

Test results may vary from machine to machine. 
In this scenario, file was located locally on the machine. 
Files on network might perform slower. 
This is provided just to give you an idea. 
So, please take these results with grain of salt.
  1. 程序包在具有 Xeon 单核 CPU 2.5GHz 和 3.00 GB RAM 的 64 位计算机上执行。

  2. 加载了一个包含100万行的平面文件。包执行时间约为2 分 47 秒。请参阅屏幕截图 #20 和 #21

  3. 使用测试查询部分下提供的查询来修改数据,以在包的第二次运行期间模拟更新、删除和创建新记录。

  4. 在数据库中执行以下查询后,加载包含100万行的同一文件。包执行时间约为1 分 35 秒。请参阅屏幕截图 #22 和 #23。请注意屏幕截图中重定向到目标和暂存表的行数 #22

希望有帮助。

测试查询:

--These records will be deleted during next run 
--because item ids won't match with file data.
--(111111 row(s) affected)
UPDATE dbo.ItemInfo SET ItemId = 'DEL_' + ItemId WHERE Id % 9 IN (3)

--These records will be modified to their original item type of 'General'
--because that is the data present in the file.
--(222222 row(s) affected)
UPDATE dbo.ItemInfo SET ItemType = 'Testing' + ItemId WHERE Id % 9 IN (2,6)

--These records will be reloaded into the table from the file.
--(111111 row(s) affected)
DELETE FROM dbo.ItemInfo WHERE Id % 9 IN (5,9)

平面文件连接列

Name        InputColumnWidth     DataType          OutputColumnWidth
----------  ----------------     ---------------   -----------------
Id          8                    string [DT_STR]   8
ItemId      11                   string [DT_STR]   11
ItemName    21                   string [DT_STR]   21
ItemType    9                    string [DT_STR]   9

脚本: (创建表和存储过程)

CREATE TABLE [dbo].[ItemInfo](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [ItemId] [varchar](255) NOT NULL,
    [ItemName] [varchar](255) NOT NULL,
    [ItemType] [varchar](255) NOT NULL,
    [IsProcessed] [bit] NULL,
    CONSTRAINT [PK_ItemInfo] PRIMARY KEY CLUSTERED ([Id] ASC),
    CONSTRAINT [UK_ItemInfo_ItemId] UNIQUE NONCLUSTERED ([ItemId] ASC)) ON [PRIMARY]
GO

CREATE TABLE [dbo].[Staging](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [ItemId] [varchar](255) NOT NULL,
    [ItemName] [varchar](255) NOT NULL,
    [ItemType] [varchar](255) NOT NULL,
 CONSTRAINT [PK_Staging] PRIMARY KEY CLUSTERED ([Id] ASC)) ON [PRIMARY]
GO

CREATE PROCEDURE [dbo].[PostLoad]
AS
BEGIN
    SET NOCOUNT ON;

    UPDATE      ITM
    SET         ITM.ItemName    = STG.ItemName
            ,   ITM.ItemType    = STG.ItemType 
            ,   ITM.IsProcessed = 1
    FROM        dbo.ItemInfo    ITM
    INNER JOIN  dbo.Staging     STG
    ON          ITM.ItemId      = STG.ItemId;

    DELETE FROM dbo.ItemInfo
    WHERE       IsProcessed = 0;
END
GO

CREATE PROCEDURE [dbo].[PreLoad]
AS
BEGIN
    SET NOCOUNT ON;

    TRUNCATE TABLE dbo.Staging;     

    UPDATE  dbo.ItemInfo 
    SET     IsProcessed = 0;
END
GO

屏幕截图 #1:

1

屏幕截图 #2:

2

屏幕截图 #3:

3

屏幕截图 #4:

4

屏幕截图 #5 :

5

屏幕截图#6:

6

屏幕截图 #7:

7

屏幕截图 #8:

8

屏幕截图 #9:

< img src="https://i.sstatic.net/O4qfV.png" alt="9">

屏幕截图 #10:

10

屏幕截图 #11:

11

屏幕截图#12:

12

屏幕截图 #13:

13

屏幕截图 #14:

14

屏幕截图 #15:

15

屏幕截图 #16: >

16

屏幕截图 #17:

17

屏幕截图 #18:

18

屏幕截图 #19:

19

屏幕截图 #20:

20

屏幕截图 #21:

21

屏幕截图 #22:

22

屏幕截图 #23:< /strong>

23

Yes, SSIS can perform the requirements that you have specified in the question. Following example should give you an idea of how it can be done. Example uses SQL Server as the back-end. Some of the basic test scenarios performed on the package are provided below. Sorry for the lengthy answer.

Step-by-step process:

  1. In the SQL Server database, create two tables namely dbo.ItemInfo and dbo.Staging. Create table queries are available under Scripts section. Structure of these tables are shown in screenshot #1. ItemInfo will hold the actual data and Staging table will hold the staging data to compare and update the actual records. Id column in both these tables is an auto-generated unique identity column. IsProcessed column in the table ItemInfo will be used to identify and delete the records that are no longer valid.

  2. Create an SSIS package and create 5 variables as shown in screenshot #2. I have used .txt extension for the tab delimited files and hence the value *.txt in the variable FileExtension. FilePath variable will be assigned with value during run-time. FolderLocation variable denotes where the files will be located. SQLPostLoad and SQLPreLoad variables denote the stored procedures used during the pre-load and post-load operations. Scripts for these stored procedures are provided under the Scripts section.

  3. Create an OLE DB connection pointing to the SQL Server database. Create a flat file connection as shown in screenshots #3 and #4. Flat File Connection Columns section contains column level information. Screenshot #5 shows the columns data preview.

  4. Configure the Control Flow Task as shown in screenshot #6. Configure the tasks Pre Load, Post Load and Loop Files as shown in screenshots #7 - #10. Pre Load will truncate staging table and set IsProcessed flag to false for all rows in ItemInfo table. Post Load will update the changes and will delete rows in database that are not found in the file. Refer the stored procedures used in those tasks to understand what is being done in these Execute SQL tasks.

  5. Double-click on the Load Items data flow task and configure it as shown in screenshot #11. Read File is a flat file source configured to use the flat file connection. Row Count is derived column transformation and its configuration is shown in screenshto #12. Check Exist is a lookup transformation and its configurations are shown in screenshots #13 - #15. Lookup No Match Output is redirected to Destination Split on the left side. Lookup Match Output is redirected to Staging Split on the left side. Destination Split and Staging Split have the exact same configuration as shown in screenshot #16. The reason for 9 different destinations for both destination and staging table is to improve the performance of the package.

  6. All the destination tasks 0 - 8 are configured to insert data into table dbo.ItemInfo as shown in screenshot #17. All the staging tasks 0 - 8 are configured to insert data into dbo.Staging as shown in screenshot #18.

  7. On the Flat File connection manager, set the ConnectionString property to use the variable FilePath as shown in screenshot #19. This will enable the package to use the value set in the variable as it loops through each file in a folder.

Test scenarios:

Test results may vary from machine to machine. 
In this scenario, file was located locally on the machine. 
Files on network might perform slower. 
This is provided just to give you an idea. 
So, please take these results with grain of salt.
  1. Package was executed on a 64-bit machine with Xeon single core CPU 2.5GHz and 3.00 GB RAM.

  2. Loaded a flat file with 1 million rows. Package executed in about 2 mins 47 seconds. Refer screenshots #20 and #21.

  3. Used the queries provided under Test queries section to modify the data to simulate update, delete and creation of new records during the second run of the package.

  4. Loaded the same file containing the 1 million rows after the following queries were executed in the database. Package executed in about 1 min 35 seconds. Refer screenshots #22 and #23. Please note the number of rows redirected to destination and staging table in screenshot #22.

Hope that helps.

Test queries:
.

--These records will be deleted during next run 
--because item ids won't match with file data.
--(111111 row(s) affected)
UPDATE dbo.ItemInfo SET ItemId = 'DEL_' + ItemId WHERE Id % 9 IN (3)

--These records will be modified to their original item type of 'General'
--because that is the data present in the file.
--(222222 row(s) affected)
UPDATE dbo.ItemInfo SET ItemType = 'Testing' + ItemId WHERE Id % 9 IN (2,6)

--These records will be reloaded into the table from the file.
--(111111 row(s) affected)
DELETE FROM dbo.ItemInfo WHERE Id % 9 IN (5,9)

Flat File Connection Columns
.

Name        InputColumnWidth     DataType          OutputColumnWidth
----------  ----------------     ---------------   -----------------
Id          8                    string [DT_STR]   8
ItemId      11                   string [DT_STR]   11
ItemName    21                   string [DT_STR]   21
ItemType    9                    string [DT_STR]   9

Scripts: (to create both tables and stored procedures)
.

CREATE TABLE [dbo].[ItemInfo](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [ItemId] [varchar](255) NOT NULL,
    [ItemName] [varchar](255) NOT NULL,
    [ItemType] [varchar](255) NOT NULL,
    [IsProcessed] [bit] NULL,
    CONSTRAINT [PK_ItemInfo] PRIMARY KEY CLUSTERED ([Id] ASC),
    CONSTRAINT [UK_ItemInfo_ItemId] UNIQUE NONCLUSTERED ([ItemId] ASC)) ON [PRIMARY]
GO

CREATE TABLE [dbo].[Staging](
    [Id] [int] IDENTITY(1,1) NOT NULL,
    [ItemId] [varchar](255) NOT NULL,
    [ItemName] [varchar](255) NOT NULL,
    [ItemType] [varchar](255) NOT NULL,
 CONSTRAINT [PK_Staging] PRIMARY KEY CLUSTERED ([Id] ASC)) ON [PRIMARY]
GO

CREATE PROCEDURE [dbo].[PostLoad]
AS
BEGIN
    SET NOCOUNT ON;

    UPDATE      ITM
    SET         ITM.ItemName    = STG.ItemName
            ,   ITM.ItemType    = STG.ItemType 
            ,   ITM.IsProcessed = 1
    FROM        dbo.ItemInfo    ITM
    INNER JOIN  dbo.Staging     STG
    ON          ITM.ItemId      = STG.ItemId;

    DELETE FROM dbo.ItemInfo
    WHERE       IsProcessed = 0;
END
GO

CREATE PROCEDURE [dbo].[PreLoad]
AS
BEGIN
    SET NOCOUNT ON;

    TRUNCATE TABLE dbo.Staging;     

    UPDATE  dbo.ItemInfo 
    SET     IsProcessed = 0;
END
GO

Screenshot #1:

1

Screenshot #2:

2

Screenshot #3:

3

Screenshot #4:

4

Screenshot #5:

5

Screenshot #6:

6

Screenshot #7:

7

Screenshot #8:

8

Screenshot #9:

9

Screenshot #10:

10

Screenshot #11:

11

Screenshot #12:

12

Screenshot #13:

13

Screenshot #14:

14

Screenshot #15:

15

Screenshot #16:

16

Screenshot #17:

17

Screenshot #18:

18

Screenshot #19:

19

Screenshot #20:

20

Screenshot #21:

21

Screenshot #22:

22

Screenshot #23:

23

给妤﹃绝世温柔 2024-11-15 15:10:28

假设您正在使用 SQL 代理(或类似的调度程序)

要求 1/4),我将有一个前置步骤处理 FTP 和/或文件复制步骤。如果可以避免的话,我不喜欢用文件操作来弄乱我的包。

要求 2/3) 在控制流级别,包设计看起来像一个连接到数据流的执行 SQL 任务,该数据流连接到另一个执行 SQL 任务。正如@AllenG 所指出的,最好通过数据流任务加载到临时表中。第一个执行 SQL 任务将从临时表中清除所有行 (TRUNCATE TABLE dbo.DAILY_STAGE)

大约的表设计如下所示。 MICHAEL_BORN 表是您现有的表,DAILY_STAGE 是您的数据流将到达的位置。

CREATE TABLE DBO.MICHAEL_BORN
(
    ID int identity(1,1) NOT NULL PRIMARY KEY CLUSTERED
,   ItemID int NOT NULL
,   ItemName varchar(20) NOT NULL
,   ItemType varchar(20) NOT NULL
)
CREATE TABLE dbo.DAILY_STAGE
(
    ItemID int NOT NULL PRIMARY KEY CLUSTERED
,   ItemName varchar(20) NOT NULL
,   ItemType varchar(20) NOT NULL
)

出于演示目的,我将通过 TSQL 加载带有示例数据的上述表格。

-- Original data
INSERT INTO
    dbo.MICHAEL_BORN
VALUES
    (2345,'Apple','Fruit')
,   (4578, 'Bannana','Fruit')


-- Daily load runs
-- Adds a new fruit (pear), corrects misspelling of banana, eliminates apple
INSERT INTO
    dbo.DAILY_STAGE
VALUES
    (7721,'Pear','Fruit')
,   (4578, 'Banana','Fruit')

执行 SQL 任务将利用 MERGE 语句在 SQL Server 2008+ 版本中可用。请注意尾随分号是 MERGE 语句的一部分。如果不包含它,将导致错误“A MERGE 语句必须以分号 (;) 终止”。

-- MERGE statement
-- http://technet.microsoft.com/en-us/library/bb510625.aspx
-- Given the above scenario, this script will
-- 1)  Update the matched (4578 bannana/banana) row
-- 2)  Add the new (pear) row
-- 3)  Remove the unmatched (apple) row

MERGE
    dbo.[MICHAEL_BORN] AS T
USING
(
    SELECT
        ItemID
    ,   ItemName
    ,   ItemType
    FROM
        dbo.DAILY_STAGE

) AS S
ON T.ItemID = S.ItemID
WHEN
    MATCHED THEN
    UPDATE
    SET
        T.ItemName = S.ItemName
    ,   T.ItemType = S.ItemType
WHEN
    NOT MATCHED THEN
    INSERT
    (
        ItemID
    ,   ItemName
    ,   ItemType
    )
    VALUES
    (
        ItemID
    ,   ItemName
    ,   ItemType
    )
WHEN
    NOT MATCHED BY SOURCE THEN
    DELETE
    ;

要求 5)效率完全取决于您的数据和行的宽度,但它不应该很糟糕。

-- Performance testing
-- Assumes you have a similar fast row number generator function
-- http://billfellows.blogspot.com/2009/11/fast-number-generator.html

TRUNCATE TABLE dbo.MICHAEL_BORN
TRUNCATE TABLE dbo.DAILY_STAGE

-- load initial rows
-- 20ish seconds
INSERT INTO
    dbo.MICHAEL_BORN
SELECT
    N.number AS ItemID
,   'Spam & eggs ' + CAST(N.number AS varchar(10)) AS ItemName
,   'SPAM' AS ItemType
--, CASE N.number % 2 WHEN 0 THEN N.number + 1000000 ELSE N.number END AS UpTheEvens
FROM
    dbo.GenerateNumbers(1000000) N


-- Load staging table
-- Odds get item type switched out
-- Evens get delete and new ones created
-- 20ish seconds
INSERT INTO
    dbo.DAILY_STAGE
SELECT
    CASE N.number % 2 WHEN 0 THEN N.number + 1000000 ELSE N.number END AS ItemID
,   'Spam & eggs ' + CAST(N.number AS varchar(10)) AS ItemName
,   CASE N.number % 2 WHEN 0 THEN 'SPAM' ELSE 'Not much spam' END AS ItemType
FROM
    dbo.GenerateNumbers(1000000) N


-- Run MERGE statement, 32 seconds 1.5M rows upserted
-- Probably fast enough for you

Assuming you are using SQL Agent (or similar scheduler)

Reqs 1/4) I would have a precursor step handle the FTP and/or file copy steps. I don't like to clutter my packages with file manipulation if I can avoid it.

Reqs 2/3) At the control flow level, the package design is going to look like an Execute SQL task connected to a Data Flow connected to another Execute SQL task. As @AllenG indicated, you'd be best served by loading into a staging table via the Data flow task. The first Execute SQL Task will purge any rows from the staging table (TRUNCATE TABLE dbo.DAILY_STAGE)

Approximate table design looks like this. The MICHAEL_BORN table is your existing table and the DAILY_STAGE is where your data flow will land.

CREATE TABLE DBO.MICHAEL_BORN
(
    ID int identity(1,1) NOT NULL PRIMARY KEY CLUSTERED
,   ItemID int NOT NULL
,   ItemName varchar(20) NOT NULL
,   ItemType varchar(20) NOT NULL
)
CREATE TABLE dbo.DAILY_STAGE
(
    ItemID int NOT NULL PRIMARY KEY CLUSTERED
,   ItemName varchar(20) NOT NULL
,   ItemType varchar(20) NOT NULL
)

For demonstration purposes, I will load the above tables with sample data via TSQL

-- Original data
INSERT INTO
    dbo.MICHAEL_BORN
VALUES
    (2345,'Apple','Fruit')
,   (4578, 'Bannana','Fruit')


-- Daily load runs
-- Adds a new fruit (pear), corrects misspelling of banana, eliminates apple
INSERT INTO
    dbo.DAILY_STAGE
VALUES
    (7721,'Pear','Fruit')
,   (4578, 'Banana','Fruit')

The Execute SQL task will take advantage of the MERGE statement available in 2008+ editions of SQL Server. Please note the trailing semi-colon is part of the MERGE statement. Failure to include it will result in an error of "A MERGE statement must be terminated by a semi-colon (;)."

-- MERGE statement
-- http://technet.microsoft.com/en-us/library/bb510625.aspx
-- Given the above scenario, this script will
-- 1)  Update the matched (4578 bannana/banana) row
-- 2)  Add the new (pear) row
-- 3)  Remove the unmatched (apple) row

MERGE
    dbo.[MICHAEL_BORN] AS T
USING
(
    SELECT
        ItemID
    ,   ItemName
    ,   ItemType
    FROM
        dbo.DAILY_STAGE

) AS S
ON T.ItemID = S.ItemID
WHEN
    MATCHED THEN
    UPDATE
    SET
        T.ItemName = S.ItemName
    ,   T.ItemType = S.ItemType
WHEN
    NOT MATCHED THEN
    INSERT
    (
        ItemID
    ,   ItemName
    ,   ItemType
    )
    VALUES
    (
        ItemID
    ,   ItemName
    ,   ItemType
    )
WHEN
    NOT MATCHED BY SOURCE THEN
    DELETE
    ;

Req 5) Efficiency is totally based on your data and how wide your rows are but it shouldn't be terrible.

-- Performance testing
-- Assumes you have a similar fast row number generator function
-- http://billfellows.blogspot.com/2009/11/fast-number-generator.html

TRUNCATE TABLE dbo.MICHAEL_BORN
TRUNCATE TABLE dbo.DAILY_STAGE

-- load initial rows
-- 20ish seconds
INSERT INTO
    dbo.MICHAEL_BORN
SELECT
    N.number AS ItemID
,   'Spam & eggs ' + CAST(N.number AS varchar(10)) AS ItemName
,   'SPAM' AS ItemType
--, CASE N.number % 2 WHEN 0 THEN N.number + 1000000 ELSE N.number END AS UpTheEvens
FROM
    dbo.GenerateNumbers(1000000) N


-- Load staging table
-- Odds get item type switched out
-- Evens get delete and new ones created
-- 20ish seconds
INSERT INTO
    dbo.DAILY_STAGE
SELECT
    CASE N.number % 2 WHEN 0 THEN N.number + 1000000 ELSE N.number END AS ItemID
,   'Spam & eggs ' + CAST(N.number AS varchar(10)) AS ItemName
,   CASE N.number % 2 WHEN 0 THEN 'SPAM' ELSE 'Not much spam' END AS ItemType
FROM
    dbo.GenerateNumbers(1000000) N


-- Run MERGE statement, 32 seconds 1.5M rows upserted
-- Probably fast enough for you
不美如何 2024-11-15 15:10:28

我只是想为下一个可能会遇到这个问题的人提供我的想法。所以我将针对每个场景提出我的想法。
1.从FTP或本地获取文件。
我建议您使用 Drop box、Google Drive 或您选择的任何其他文件同步云服务,请参阅 链接了解详细信息。
2.我建议按照您的建议将所有平面文件数据加载到临时表中然后通过在临时表和目标表之间使用唯一列(ID)上的合并来轻松比较数据。您可以查看链接了解如何使用合并脚本。第二次和第二次如果您使用 MERGE 脚本,第三种情况将得到解决。
对于最后两种情况,我建议您使用 SQL JOB 自动运行包,并安排在下班时间或服务器不忙的时间运行。请查看链接以了解有关如何使用 SQL 运行包的详细信息服务器代理作业只需在您最喜欢的搜索引擎上输入它,您就会发现大量博客展示其完成方式。

I just want to give my idea for the next guy who may pass by this question. So I'm going to suggest my idea for each scenario's.
1. Getfile from FTP or local.
I would suggest you to use Drop box, Google Drive or any other file syncing cloud services of your choice see this link for detail.
2. I would suggest loading all flat file data to staging table as you suggested Then comparing the data would be easily done by using MERGE between your staging table and Target table on your unique column (ID). You can see this link for how to use merge script. The 2nd & 3rd scenarios will be solved if you are using MERGE Script.
For the last two scenarios i suggest you use SQL JOB to automatically run the package and schedule it at off hours or on time where the server is not busy.Please take a look at the link for detail on how to Run a Package Using a SQL Server Agent Job just type it on your favorite search engine and you will find tons of blogs that shows how its done.

看春风乍起 2024-11-15 15:10:28

SSIS 听起来是个不错的选择。我之前看到处理您的问题类型的方式是使用暂存表。新文档加载到暂存表中 - 然后比较暂存和生产 - 过时的记录从生产中存档(不仅仅是删除),更新具有某些更改的现有行(同样,原始数据存档在某处),并插入新行。

注意:您对“过时”的定义需要非常非常精确。例如:是否应该仅仅因为最近的文件中不存在匹配的行就将某些内容归档?它是否应该保留 X 时间,以防它出现在后续文件中?应考虑这些问题和其他问题。

几乎所有标准 SSIS 教程都应该为您指明如何执行每个步骤的正确路径。

SSIS Sounds like the way to go. The way I've seen your type of issue handled previously is with a Staging Table. The new document loads into the Staging Table- then Staging and Production are compared- obsolete records are archived (not JUST deleted) from Production, existing rows with some changes are updated (again, original data archived somewhere), and new rows are inserted.

Note: Your definition of "obsolete" needs to be very, very precise. For instance: should something be archived away just because a matching row does not exist in your most recent file? Should it stay for X amount of time in case it comes on a subsequent file? These and other questions should be considered.

Almost any standard SSIS tutorial should point you down the correct path for how to do each of these steps.

风筝在阴天搁浅。 2024-11-15 15:10:28

我会尝试合并。确保最终在两个表上都有 ItemID 索引。

Merge [dbo].[ItemInfo] as target
using
(
    SELECT stg.ItemID, stg.ItemName, stg.ItemType
    FROM [dbo].[ItemInfo_Staging] stg
    LEFT OUTER JOIN [dbo].[ItemInfo] final
        on stg.ItemId = final.ItemId
) as SOURCE
ON SOURCE.ItemID = target.ItemID

WHEN MATCHED THEN
    Update SET
        target.ItemID = SOURCE.ItemID
        , target.ItemName = SOURCE.ItemName
        , target.ItemType = SOURCE.ItemType

WHEN NOT MATCHED BY TARGET THEN
    INSERT (ItemID, ItemName, ItemType )
        VALUES (SOURCE.ItemID, SOURCE.ItemName, SOURCE.ItemType ) 

WHEN NOT MATCHED BY SOURCE THEN
    DELETE
;

I would give Merge a shot. Make sure you eventually have indexes on ItemID on both the tables.

Merge [dbo].[ItemInfo] as target
using
(
    SELECT stg.ItemID, stg.ItemName, stg.ItemType
    FROM [dbo].[ItemInfo_Staging] stg
    LEFT OUTER JOIN [dbo].[ItemInfo] final
        on stg.ItemId = final.ItemId
) as SOURCE
ON SOURCE.ItemID = target.ItemID

WHEN MATCHED THEN
    Update SET
        target.ItemID = SOURCE.ItemID
        , target.ItemName = SOURCE.ItemName
        , target.ItemType = SOURCE.ItemType

WHEN NOT MATCHED BY TARGET THEN
    INSERT (ItemID, ItemName, ItemType )
        VALUES (SOURCE.ItemID, SOURCE.ItemName, SOURCE.ItemType ) 

WHEN NOT MATCHED BY SOURCE THEN
    DELETE
;
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文