如何在准唯一数据键上创建主键?

发布于 2024-11-27 23:08:43 字数 715 浏览 0 评论 0原文

我有一个每晚的 SSIS 进程,它从 AS400 数据库系统导出大量数据。由于AS400 DB软件的错误,数据表中偶尔会插入重复的键。每次将新的重复项添加到 AS400 表中时,都会终止我每晚的导出过程。这个问题已经从一个麻烦变成了一个问题。

我需要的是有一个选项只插入唯一的数据。如果存在重复项,则选择重复行中第一个遇到的行。是否有可用的 SQL 语法可以帮助我做到这一点?我知道 DISTINCT ROW 子句,但这在我的情况下不起作用,因为对于大多数违规记录,除了构成 PK 的字段之外,整个数据都是非唯一的。

就我而言,更重要的是主键在 SQL Server 数据库缓存中保持唯一,而不是拥有完整的数据快照。我可以做些什么来强制 SSIS/SQL Server 中的导出受到此约束,而不会导致进程崩溃?

编辑

让我进一步澄清我的要求。我需要的是确保导出的 SQL Server 表中的数据维护与 AS400 数据表相同的键。换句话说,创建唯一的行计数标识符是行不通的,在没有主键的情况下插入所有数据也是行不通的。

如果 AS400 软件中的错误允许错误的重复 PK,我想忽略这些行,或者最好只选择具有重复键的行之一,而不是同时选择两者。

此 SELECT 语句可能应该来自我的 SSIS 项目中的 SELECT 语句,该项目通过 ODBC 连接连接到大型机。

我怀疑我的问题可能没有一个“简单”的解决方案。然而,我希望我错了。

I have a nightly SSIS process that exports a TON of data from an AS400 database system. Due to bugs in the AS400 DB software, ocassional duplicate keys are inserted into data tables. Every time a new duplicate is added to an AS400 table, it kills my nightly export process. This issue has moved from being a nuisance to a problem.

What I need is to have an option to insert only unique data. If there are duplicates, select the first encountered row of the duplicate rows. Is there SQL Syntax available that could help me do this? I know of the DISTINCT ROW clause but that doesn't work in my case because for most of the offending records, the entirety of the data is non-unique except for the fields which comprise the PK.

In my case, it is more important for my primary keys to remain unique in my SQL Server DB cache, rather than having a full snapshot of data. Is there something I can do to force this constraint on the export in SSIS/SQL Server with out crashing the process?

EDIT

Let me further clarify my request. What I need is to assure that the data in my exported SQL Server tables maintains the same keys that are maintained the AS400 data tables. In other words, creating a unique Row Count identifier wouldn't work, nor would inserting all of the data without a primary key.

If a bug in the AS400 software allows for mistaken, duplicate PKs, I want to either ignore those rows or, preferably, just select one of the rows with the duplicate key but not both of them.

This SELECT statement should probably happen from the SELECT statement in my SSIS project which connects to the mainframe through an ODBC connection.

I suspect that there may not be a "simple" solution to my problem. I'm hoping, however, that I'm wrong.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

無處可尋 2024-12-04 23:08:43

由于您使用的是 SSIS,因此必须使用 OLE DB Source 从 AS400 获取数据,并且将使用 OLE DB Destination 将数据插入 SQL Server。

假设您没有任何转换

Data Flow Task

在OLE DB 源。在排序转换中,底部有一个复选框选项,用于根据给定的一组列值删除重复的行。检查所有字段,但不要选择来自 AS400 的主键。这将消除重复的行,但会插入您仍然需要的数据。

Sort

我希望这就是您正在寻找的。

Since you are using SSIS, you must be using OLE DB Source to fetch the data from AS400 and you will be using OLE DB Destination to insert data into SQL Server.

Let's assume that you don't have any transformations

Data Flow Task

Add a Sort transformation after the OLE DB Source. In the Sort Transformation, there is a check box option at the bottom to remove duplicate rows based on a give set of column values. Check all the fields but don't select the Primary Key that comes from AS400. This will eliminate the duplicate rows but will insert the data that you still need.

Sort

I hope that is what you are looking for.

呆橘 2024-12-04 23:08:43

SQL Server 2005 及更高版本中:

SELECT  *
FROM    (
        SELECT  *,
                ROW_NUMBER() OVER (PARTITION BY almost_unique_field ORDER BY id) rn
        FROM    import_table
        ) q
WHERE   rn = 1

In SQL Server 2005 and above:

SELECT  *
FROM    (
        SELECT  *,
                ROW_NUMBER() OVER (PARTITION BY almost_unique_field ORDER BY id) rn
        FROM    import_table
        ) q
WHERE   rn = 1
聆听风音 2024-12-04 23:08:43

有几种选择。

如果您在主键上使用 IGNORE_DUP_KEY (http://www.sqlservernation.com/home/creating-indexes-with-ignore_dup_key.html) 选项,SQL 将发出警告,并且只有重复记录才会失败。

您还可以对数据进行分组/汇总,但这可能会非常昂贵。我的意思是:

SELECT Id, MAX(value1), MAX(value2), MAX(value3) etc

另一种选择是向临时表添加一个标识列(并在此基础上进行集群,以便稍后进行有效的联接),然后在临时表中创建映射。映射表将是:

CREATE TABLE #mapping 
( 
    RowID INT PRIMARY KEY CLUSTERED,
    PKIN INT
)

INSERT INTO #mapping
SELECT PKID, MIN(rowid) FROM staging_table  
GROUP BY PKID

INSERT INTO presentation_table
SELECT S.* 
FROM Staging_table S 
    INNER JOIN #mapping M 
        ON S.RowID = M.RowID 

There are several options.

If you use IGNORE_DUP_KEY (http://www.sqlservernation.com/home/creating-indexes-with-ignore_dup_key.html) option on your primary key, SQL will issue a warning and only the duplicate records will fail.

You can also group/roll-up your data but this can get very expensive. What I mean by that is:

SELECT Id, MAX(value1), MAX(value2), MAX(value3) etc

Another option is to add an identity column (and cluster on this for an efficient join later) to your staging table and then create a mapping in a temp table. The mapping table would be:

CREATE TABLE #mapping 
( 
    RowID INT PRIMARY KEY CLUSTERED,
    PKIN INT
)

INSERT INTO #mapping
SELECT PKID, MIN(rowid) FROM staging_table  
GROUP BY PKID

INSERT INTO presentation_table
SELECT S.* 
FROM Staging_table S 
    INNER JOIN #mapping M 
        ON S.RowID = M.RowID 
若沐 2024-12-04 23:08:43

如果我理解正确的话,您有重复的 PK,在其他字段中具有不同的数据。

首先,将另一个数据库中的数据放入临时表中。我发现如果这样做的话,研究进口问题(尤其是大型进口)会更容易。实际上,我使用两个临时表(对于这种情况,我强烈推荐它),一张包含原始数据,一张仅包含我打算导入到系统中的数据。

现在您可以使用并执行 SQL 任务来获取每个键的记录之一(有关如何执行此操作的想法,请参阅 @Quassnoi,您可能需要根据您的情况调整他的查询)。就我个人而言,我将一个身份放入临时表中,以便我可以识别重复数据的第一次或最后一次出现。然后将您为每个键选择的记录放入第二个临时表中。如果您使用的是异常表,请复制不移动到其中的记录,并且不要忘记异常的原因代码(例如“重复的键”)。

现在,临时表中的每个键只有一条记录,您的下一个任务是决定如何处理其他不唯一的数据。如果同一客户有两个不同的营业地址,您会选择哪一个?这是业务规则定义的问题,而不是严格意义上的 SSIS 或 SQL 代码。当数据需要在两个记录之间合并时,您必须定义如何选择数据的业务规则(您所做的相当于重复数据删除过程)。如果幸运的话,有一个日期字段或其他方式来确定哪个是最新的或最旧的数据,这就是他们希望您使用的数据。在这种情况下,一旦您只选择了一条记录,您就完成了初始转换。

尽管您很可能需要为每个其他字段使用不同的规则才能选择正确的字段。在这种情况下,您可以在数据流或 Exec SQl 任务中编写 SSIS 转换,以选择正确的数据并更新临时表。

获得要导入的确切记录后,即可执行数据流以移动到正确的生产表。

If I understand you correctly, you have duplicated PKs that have different data in the other fields.

First, put the data from the other database into a staging table. I find it easier to research issues with imports (especially large ones) if I do this. Actually I use two staging tables (and for this case I strongly recommend it), one with the raw data and one with only the data I intend to import into my system.

Now you can use and Execute SQL task to grab the one of the records for each key (see @Quassnoi for an idea of how to do that you may need to adjust his query for your situation). Personally I put an identity into my staging table, so I can identify which is the first or last occurance of duplicated data. Then put the record you chose for each key into your second staging table. If you are using an exception table, copy the records you are not moving to it and don't forget a reason code for the exception ("Duplicated key" for instance).

Now that you have only one record per key in a staging table, your next task is to decide what to do about the other data that is not unique. If there are two different business addresses for the same customer, which do you chose? This is a matter of business rules definition not strictly speaking SSIS or SQL code. You must define the business rules for how you chose the data when the data needs to be merged between two records (what you are doing is the equivalent of a de-dupping process). If you are lucky there is a date field or other way to determine which is the newest or oldest data and that is the data they want you to use. In that case once you have selected just one record, you are done the intial transform.

More than likely though you may need different rules for each other field to choose the correct one. In this case you write SSIS transforms in a data flow or Exec SQl tasks to pick the correct data and update the staging table.

Once you have the exact record you want to import, then do the data flow to move to the correct production tables.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文