如何拆分平面文件数据并加载到数据库中的父子表中?

发布于 2024-11-23 16:32:40 字数 2021 浏览 3 评论 0原文

我有需要导入到父子表中的非规范化数据(来自文件)。源数据是这样的:

Account#    Name        Membership    Email
101         J Burns     Gold          [email protected]
101         J Burns     Gold          [email protected]
101         J Burns     Gold          [email protected]
227         H Gordon    Silver        [email protected]
350         B Clyde     Silver        [email protected]
350         B Clyde     Silver        [email protected]

我应该使用哪些 SSIS 片段、部分或策略将前三列读入父表,将第四列(电子邮件)读入子表?对于允许我采用的父键,我有多个选项:

  • 直接使用 Account# 作为主键
  • 在导入过程中使用 SSIS 生成的代理键
  • 配置身份主键

我确信我已经列出了我的主键关键选项按难度递增的顺序排列。我有兴趣知道如何执行第一个和最后一个选项 - 我将推断如何实现中间选项。再次强调,我对 SSIS 解决方案很感兴趣;我正在寻找使用 SSIS 语言的答案,而不是过程性的、技术中立的答案。

我的问题有点类似于 另一个SO问题,有一个模糊的可行性答案。希望能给予更详细的指导。我已经知道如何通过创建“暂存”中间步骤来解决这个问题,其中父子分离实际上是通过直接 SQL 来处理的。然而,我很好奇如果没有这种中间步骤如何做到这一点。

在我看来,这种导入是如此普遍,以至于会有一种已发表的公式化方法来处理它——SSIS 擅长的一种技术。到目前为止,我还没有看到任何直接的答案。

更新#1:根据评论,我调整了示例数据,使其更加明显地非规范化。我还从“平面文件”中删除了“平面”,以便语义不会干扰问题。

更新#2:我对使用 SSIS 语言的解决方案更加感兴趣。

I have denormalized data (coming from a file) that needs to be imported into parent-child tables. The source data is something like this:

Account#    Name        Membership    Email
101         J Burns     Gold          [email protected]
101         J Burns     Gold          [email protected]
101         J Burns     Gold          [email protected]
227         H Gordon    Silver        [email protected]
350         B Clyde     Silver        [email protected]
350         B Clyde     Silver        [email protected]

What are the pieces, parts, or tactics of SSIS I should use to read the first three columns into a parent table, and the 4th column (Email) into a child table? I have several options for the parent key which I am permitted to take:

  • Directly use the Account# as the primary key
  • Use a surrogate key generated by SSIS during the import process
  • Configure an identity primary key

I'm sure I've listed my primary key options in increasing order of difficulty. I'd be interested in knowing how to do the first and the last option - I'll infer how to achieve the middle option. To emphasize again, I'm interested in a decidedly SSIS solution; I'm looking for an answer that uses the language of SSIS, rather than a procedural, technology neutral answer.

My question is somewhat similar to another SO question, having an answer of vague viability. I'm hoping more detailed guidance could be given. I already know how to solve this problem by creating a "staging" middle-step, where the parent-child separation is actually handled with straight SQL. However, I'm curious about how this can be done without that kind of middle-step.

It seems to me this kind of import would be so common, that there would be a well-published formulaic way to handle it - a technique that SSIS excels at. As yet, I've not quite seen any straight up answer to this.

Update #1: Based on comments, I've adjusted the sample data to be more obviously denormalized. I also removed "flat" from "flat file," so that semantics don't interfere with the question.

Update #2: I've amplified my interest in a solution spoken in the language of SSIS.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

茶色山野 2024-11-30 16:32:40

这是您在加载父子数据时可以考虑的一种可能的选择。该选项由两个步骤组成。在第一步中,读取源文件并将数据写入父表。在第二步中,再次读取源文件并使用查找转换来获取父表信息,以便将数据写入子表。以下示例使用问题中提供的数据。此示例是使用 SSIS 2008 R2 和 SQL Server 2008 数据库创建的。

分步过程:

  1. 创建一个名为 Source.txt 的示例平面文件,如屏幕截图 #1 所示。

  2. 在 SQL 数据库中,使用 SQL 脚本 部分中给出的脚本创建两个名为 dbo.Parentdbo.Child 的表。这两个表都有一个自动生成的标识列。

  3. 在包上,放置一个 OLE DB 连接 以连接到 SQL Server 和 平面文件连接 以读取源文件,如屏幕截图所示 #2 。配置平面文件连接,如屏幕截图 #3 - #9 所示。

  4. 在“控制流”选项卡上,放置两个数据流任务,如屏幕截图 #10 所示。

  5. 在名为 Parent 的数据流任务中,放置平面文件源、排序转换和 OLE DB 目标,如屏幕截图 #11 所示。

  6. 配置平面文件源,如屏幕截图 #12 和 #13 所示。我们需要读取平面文件源。

  7. 配置排序转换,如屏幕截图 #14 所示。我们需要消除重复值,以便仅将唯一记录插入到父表 dbo.Parent 中。

  8. 配置 ole db 目标,如屏幕截图 #15 和 #16 所示。我们需要将数据插入到父表 dbo.Parent 中。

  9. 在名为 Child 的数据流任务中,放置平面文件源、查找转换和 OLE DB 目标,如屏幕截图 #17 所示。

  10. 配置平面文件源,如屏幕截图 #12 和 #13 所示。此配置与之前数据流任务中的平面文件源相同。

  11. 配置查找转换,如屏幕截图 #18 和 #20 所示。我们需要使用文件中存在的其他关键列从表 dbo.Parent 中找到父 ID。这里的关键列是帐户、姓名和电子邮件。如果文件碰巧有一个唯一的列,您可以单独使用该列来获取父 ID。

  12. 配置 ole db 目标,如屏幕截图 #21 和 #22 所示。我们需要将电子邮件列以及父 ID 插入到表 dbo.Child 中。

  13. 屏幕截图 #23 显示包执行之前中的表中的数据。

  14. 屏幕截图 #24 和 #25 显示示例包执行情况。

  15. 屏幕截图 #26 显示包执行之后的表中的数据。

希望有帮助。

SQL 脚本:

CREATE TABLE [dbo].[Child](
    [ChildId] [int] IDENTITY(1,1) NOT NULL,
    [ParentId] [int] NULL,
    [Email] [varchar](21) NULL,
CONSTRAINT [PK_Child] PRIMARY KEY CLUSTERED ([ChildId] ASC)) ON [PRIMARY]
GO

CREATE TABLE [dbo].[Parent](
    [ParentId] [int] IDENTITY(1,1) NOT NULL,
    [Account] [varchar](12) NULL,
    [Name] [varchar](12) NULL,
    [Membership] [varchar](14) NULL,
CONSTRAINT [PK_Parent] PRIMARY KEY CLUSTERED ([ParentId] ASC)) ON [PRIMARY]
GO

屏幕截图 #1:

1

屏幕截图#2:

2

屏幕截图 #3:

3

屏幕截图 #4:

4

屏幕截图 #5:

5

屏幕截图 # 6:

6

屏幕截图 #7:

7

屏幕截图 #8:

8

屏幕截图 # 9:

9

屏幕截图 #10:

10

屏幕截图 #11:

11

屏幕截图#12:

12

屏幕截图 #13:

13

屏幕截图 #14:

“14”

屏幕截图 #15:

15

屏幕截图 #16:

16

屏幕截图 #17:

17

屏幕截图 #18:

18

屏幕截图 #19:

19

屏幕截图#20:

20

屏幕截图 #21:

21

屏幕截图 #22:

22

屏幕截图 #23:

23

屏幕截图 #24:< /strong>

24

屏幕截图 #25:

25

屏幕截图 #26:

26

Here is one possible option that you can consider in loading parent-child data. This option consists of two steps. In the first step, read the source file and write data to parent table. In the second step, read the source file again and use lookup transformation to fetch the parent info in order to write data to the child table. Following example uses the data provided in the question. This example was created using SSIS 2008 R2 and SQL Server 2008 database.

Step-by-Step process:

  1. Create a sample flat file named Source.txt as shown in screenshot #1.

  2. In the SQL database, create two tables named dbo.Parent and dbo.Child using the scripts given under SQL Scripts section. Both the tables have an auto generated identity column.

  3. On the package, place an OLE DB connection to connect to the SQL Server and Flat File connection to read the source file as shown in screenshot #2. Configure the flat file connection as shown in screenshots #3 - #9.

  4. On the Control Flow tab, place two Data Flow Tasks as shown in screenshot #10.

  5. Inside the data flow task named Parent, place a Flat File source, Sort transformation and an OLE DB destination as shown in screenshot #11.

  6. Configure the flat file source as shown in screenshots #12 and #13. We need to read the flat file source.

  7. Configure the sort transformation as shown in screenshot #14. We need to eliminate the duplicate values so that only the unique records are inserted into the parent table dbo.Parent.

  8. Configure the ole db destination as shown in screenshots #15 and #16. We need to insert the data into the parent table dbo.Parent.

  9. Inside the data flow task named Child, place a Flat File source, Lookup transformation and an OLE DB destination as shown in screenshot #17.

  10. Configure the flat file source as shown in screenshots #12 and #13. This configuration is same as the flat file source in the previous data flow task.

  11. Configure the lookup transformation as shown in screenshots #18 and #20. We need to find the parent id from the table dbo.Parent using the other key columns present in the file. The key columns here are the Account, Name and Email. If the file happened to have a unique column, you could just use that column alone to fetch the parent id.

  12. Configure the ole db destination as shown in screenshots #21 and #22. We need to insert the Email column along with the Parent id into the table dbo.Child.

  13. Screenshot #23 shows data in the tables before the package execution.

  14. Screenshots #24 and #25 show sample package execution.

  15. Screenshot #26 shows data in the tables after the package execution.

Hope that helps.

SQL Scripts:

CREATE TABLE [dbo].[Child](
    [ChildId] [int] IDENTITY(1,1) NOT NULL,
    [ParentId] [int] NULL,
    [Email] [varchar](21) NULL,
CONSTRAINT [PK_Child] PRIMARY KEY CLUSTERED ([ChildId] ASC)) ON [PRIMARY]
GO

CREATE TABLE [dbo].[Parent](
    [ParentId] [int] IDENTITY(1,1) NOT NULL,
    [Account] [varchar](12) NULL,
    [Name] [varchar](12) NULL,
    [Membership] [varchar](14) NULL,
CONSTRAINT [PK_Parent] PRIMARY KEY CLUSTERED ([ParentId] ASC)) ON [PRIMARY]
GO

Screenshot #1:

1

Screenshot #2:

2

Screenshot #3:

3

Screenshot #4:

4

Screenshot #5:

5

Screenshot #6:

6

Screenshot #7:

7

Screenshot #8:

8

Screenshot #9:

9

Screenshot #10:

10

Screenshot #11:

11

Screenshot #12:

12

Screenshot #13:

13

Screenshot #14:

14

Screenshot #15:

15

Screenshot #16:

16

Screenshot #17:

17

Screenshot #18:

18

Screenshot #19:

19

Screenshot #20:

20

Screenshot #21:

21

Screenshot #22:

22

Screenshot #23:

23

Screenshot #24:

24

Screenshot #25:

25

Screenshot #26:

26

酒中人 2024-11-30 16:32:40

如果数据已排序并且 Account# 是整数,我会:

将电子邮件插入表中(添加自动增量列,这是最佳实践)。

1  101    [email protected]
2  101    [email protected]
3  101    [email protected]
etc.

然后我会将其他记录插入到父表中。

  • 使用 Account# 作为主键
  • ,省略电子邮件地址,
  • 跳过重复项(如果数据
    已排序)。

如果您有外键关系设置,则需要先执行第二步(以避免出现任何孤立记录)。

我的两分钱:我不知道你的要求是什么,但它似乎有点过度标准化。如果电子邮件地址的数量有一个小的限制,我会考虑在主表中添加几个电子邮件列......以提高速度和简单性。

If the data is sorted and Account# is an integer I would:

Insert the emails into a table (add an auto increment column, it's a best practise).

1  101    [email protected]
2  101    [email protected]
3  101    [email protected]
etc.

Then I would insert the other records to a parent table.

  • using Account# as the primary key
  • omitting the email addresses
  • skipping duplicates (easy if the data
    is sorted).

If you have a foreign key relationship setup, you will need to do the second step first (to avoid having any orphan records).

My two cents: I don't know what your requirements are but it seems a bit over-normalized. If there is a small limit on the number of email addresses, I would consider adding several email columns to the main table...for speed and simplicity.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文