使用SSIS或SQL将多个CSV文件加入单个表

发布于 2025-02-01 09:29:50 字数 500 浏览 1 评论 0原文

SSIS初学者,在过去十年中,在GitHub中遵循编码语言性能的项目。我有3个数据集:

  1. REPOS
  2. 拉请求
  3. 问题。

问题数据集如下:

“问题dataSet”

的示例,这是repos dataset

“

我的第一期,我不需要这个季度。 其次,以及如何将这些平面文件合并到一个表格和总和数据的原因是什么,

Language | Nbr of repos | Nbr of Pull requests | Year

这将是非常感谢的,我尝试了许多方法来弄清楚它,但无法实现我的目标m寻找。

SSIS beginner here, working on project that follows coding languages performance in github throughout the last decade. I have 3 datasets :

  1. Repos
  2. Pull Requests
  3. Issues.

Issues dataset is as below:

Example of issues dataset

and here is an example of Repos dataset

Repos dataset.

First issue of mine, I don't need the quarter.
Second, and what brought me here, how to merge these flat files to a single table and sum data to be

Language | Nbr of repos | Nbr of Pull requests | Year

Help would be much appreciated, I've tried many ways to figure it out but haven't been able to achieve what I'm looking for.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

独孤求败 2025-02-08 09:29:50

1-如果您不需要数据,请不要使用它。最好是,不要将其带入DataFlow管道中,而仅仅因为它在那里,这并不意味着您需要使用它。

我会主张将每个文件/选项卡/数据集降落到数据库中的匹配表中。它使您能够确保您降落与所提供的数据完全相同。完成此操作后,编写一个自定义查询以将各种表连接在一起。类似于

SELECT
    R.language
,   R.num_repos AS [Nbr of repos]
,   SUM(PR.[count]) AS [Nbr of Pull requests]
,   PR.year
FROM
    dbo.repos AS R
    -- Assuming inner is accurate
    INNER JOIN
        dbo.PullRequests AS PR
        ON PR.name = R.language
GROUP BY
    R.language
,   R.num_repos
,   PR.year;

总和和组之类的事实反映了您可能希望每年获得总数,并且由于数据似乎是每季度报告的,因此应该将其结合起来。或者,这可能是您需要获得给定名称和计数的最后一个季度所需的半添加数据(语言Monkeybrainz在Q3 2021中将其重命名为Laconicape,因此MB的最后一个条目是在Q3 2021中,la开始Q4 Q4 ,2021)

然后,您可以将该自定义查询用作SSIS中的源来生成您的提取物(到CSV,Excel,to Excel,to apooche flerken-在2033年将是巨大的)

SSIS路由

将数据列入SSIS中有一些选项:SSIS:合并加入,查找,脚本任务。

合并加入需要根据密钥对数据进行排序。在这种情况下,名称/语言需要从源组件(Flat File/Excel?)进行排序转换。那被送入合并加入。

如果您在拉的请求中都有所有4个季度,并且想要一年的总统计数据,则您可能需要在源和排序之间进行汇总转换,以按名称和年度计数,或者如果是半添加的,然后找到名字和年份的最后一个季度。

非排序方法将涉及两个数据流和查找组件。查找最多可以进行一场比赛。在提供的屏幕截图中,我假设拉动请求是“驱动程序”表,因为从“成立年”到当前,C ++将在其中。而您的存储库数据集只有一次只有C ++。因此,我们将从REPOS数据集中的一些字段中扩大“当前”行。

这里唯一的打ic是,查找支持来自OLE DB提供商或SSIS Cache Connection Manager的数据。同样,假设您的数据是平面文件/excel,我们需要使用CACHENCENT CONNECTION MANAGAR。

数据流1将使用我们的回购数据“ prime the Pump”。您的数据流将是来源 - >缓存连接转换

将CACHEnction Manager中的密钥定义为语言

缓存连接管理器

在数据流2中,您将使用带有Cache Connection Manager源的查找转换,然后将其映射到语言中,然后单击检查在语言和num_repos旁边的框,将它们带入您的管道中。

最后,将结果写给您需要的任何地方。

1 - If you don't need data, don't use it. Preferably, don't bring it into the dataflow pipeline but just because it's there, it doesn't mean you need to use it.

I would advocate landing each file/tab/dataset into a matching table in your database. It allows you to ensure you've landed exactly the same data that was provided. Once that's done, write a custom query to join the assorted tables together. Something like

SELECT
    R.language
,   R.num_repos AS [Nbr of repos]
,   SUM(PR.[count]) AS [Nbr of Pull requests]
,   PR.year
FROM
    dbo.repos AS R
    -- Assuming inner is accurate
    INNER JOIN
        dbo.PullRequests AS PR
        ON PR.name = R.language
GROUP BY
    R.language
,   R.num_repos
,   PR.year;

The SUM and GROUP BY reflects the fact that you likely want to get the total counts per year and since the data appears to be reported at a quarterly interval, that should wrap that up. Or, it could be that this is semi-additive data you need to get the last available quarter for a given name and count (Language monkeyBrainz got renamed to LaconicApe in Q3 2021 so the last entry for MB is in Q3 2021 and LA starts Q4, 2021)

You can then use that custom query as a source in SSIS to generate your extract (to CSV, to Excel, to Apooche Flerken - it'll be huge in 2033)

SSIS route

Bringing data together has a few options in SSIS: Merge Join, Lookup, Script Task.

Merge Join requires data to be sorted based on the key(s). In this case, Name/Language needs to have a Sort transformation from the source component (flat file/excel?). That gets fed into the Merge Join.

If you have all 4 quarters in the pull requests and you want to total stats for the year, you'll likely want to have the Aggregate transformation between the source and the Sort to SUM the count by name and year or if it's semi-additive, then find the last quarter for name and year.

A non-sorting method would involve two data flows and the lookup component. A lookup can make, at most, one match. In the supplied screenshots, I would assume that the Pull Requests is the "driver" table as C++ will be in there from "year of inception" to current. Whereas your repos dataset would only have C++ in there once. So, we will augment the "current" row from pull requests with a few fields from the repos data set.

The only hiccup here is that the Lookup supports data from an OLE DB provider or the SSIS Cache Connection Manager. Again, assuming your data is Flat File/Excel, we'll need to use a Cache Connection Manager.

Data Flow 1 will "prime the pump" with our Repo data. Your Data Flow will be Source -> Cache Connection Transformation

Define the Key in your cache connection manager as the Language

Cache Connection Manager

In Data Flow 2, you'll use a Lookup Transformation with a source of Cache Connection Manager and then map Name to Language and click the check box next to Language and num_repos to bring them into your pipeline.

Finally, write the results to wherever you need them.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文