导入唯一的数据 - c#/sql server 解决方案

发布于 2024-12-23 18:43:03 字数 1876 浏览 0 评论 0原文

我们定期需要导入一个如下所示的 CSV:

Name,SpecID,TestResult1,TestResult2,TestResult3
Alex,ASD123,3.23,452.2,232
Craig,DFG444,453.56,345.3,23

数据以这种方式存储:

SPECIMENTABLE (name,specid,SPECIMENTABLEID)
Alex,ASD123,1
Craig,DFG444,2

并且

    RESULTTABLE (testresult,result,SPECIMENTABLEID)
    TestResult1,3.23,1
    TestResult2,452.2,1
    TestResult3,232,1
    TestResult1, 453.56,2
    etc

我像这样转储数据:

    public void DumpQuickLabDump()
    {
        // T-SQL Connection
        string connection = "Data Source=gaia;Initial Catalog=SalesDWH;Integrated Security=True";


        // Get the data into the DataTable
        //dtData = GetData(...);

        // Create an object of SqlBulkCopy
        SqlBulkCopy objSBC = new SqlBulkCopy(connection);
        // Specify the destination table
        objSBC.BulkCopyTimeout = 0;
        objSBC.BatchSize = 10000;
        objSBC.DestinationTableName = "SpecimenTable";
        // Write the data to the SQL Server

        objSBC.WriteToServer(QuickLabDump);
    }
    public void DumpTestResults()
    {
        // T-SQL Connection
        string connection = "Data Source=gaia;Initial Catalog=SalesDWH;Integrated Security=True";


        // Get the data into the DataTable
        //dtData = GetData(...);

        // Create an object of SqlBulkCopy
        SqlBulkCopy objSBC = new SqlBulkCopy(connection);
        // Specify the destination table
        objSBC.BulkCopyTimeout = 0;
        objSBC.BatchSize = 10000;
        objSBC.DestinationTableName = "ResultTable";
        // Write the data to the SQL Server

        objSBC.WriteToServer(TestResults);
    }

有时客户会向我提交一个 CSV 进行上传,然后几天后,他们会导入另一个 CSV,但是它将有一定比例的相同记录。

我如何避免重复数据? (请记住,两个表是从一个 CSV 文件填充到数据库中的)

解决方案可以是 .NET 或 sql。

太感谢了

we periodically need to import a CSV that looks like this:

Name,SpecID,TestResult1,TestResult2,TestResult3
Alex,ASD123,3.23,452.2,232
Craig,DFG444,453.56,345.3,23

the data gets stored this way:

SPECIMENTABLE (name,specid,SPECIMENTABLEID)
Alex,ASD123,1
Craig,DFG444,2

and

    RESULTTABLE (testresult,result,SPECIMENTABLEID)
    TestResult1,3.23,1
    TestResult2,452.2,1
    TestResult3,232,1
    TestResult1, 453.56,2
    etc

im dumping the data like this:

    public void DumpQuickLabDump()
    {
        // T-SQL Connection
        string connection = "Data Source=gaia;Initial Catalog=SalesDWH;Integrated Security=True";


        // Get the data into the DataTable
        //dtData = GetData(...);

        // Create an object of SqlBulkCopy
        SqlBulkCopy objSBC = new SqlBulkCopy(connection);
        // Specify the destination table
        objSBC.BulkCopyTimeout = 0;
        objSBC.BatchSize = 10000;
        objSBC.DestinationTableName = "SpecimenTable";
        // Write the data to the SQL Server

        objSBC.WriteToServer(QuickLabDump);
    }
    public void DumpTestResults()
    {
        // T-SQL Connection
        string connection = "Data Source=gaia;Initial Catalog=SalesDWH;Integrated Security=True";


        // Get the data into the DataTable
        //dtData = GetData(...);

        // Create an object of SqlBulkCopy
        SqlBulkCopy objSBC = new SqlBulkCopy(connection);
        // Specify the destination table
        objSBC.BulkCopyTimeout = 0;
        objSBC.BatchSize = 10000;
        objSBC.DestinationTableName = "ResultTable";
        // Write the data to the SQL Server

        objSBC.WriteToServer(TestResults);
    }

sometimes the client will submit a CSV to me for uploading and then several days later, they will import another CSV but it will have a percentage of the SAME RECORDS.

how do i avoid duplicating the data? (please keep in mind that two tables are getting populated in the database from the one CSV file)

the solution can be .NET or sql.

thank you so much

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

梦毁影碎の 2024-12-30 18:43:03

您无法直接使用 SqlBulkCopy 执行您想要的操作。但是,您可以将行批量复制到工作表中,然后使用 MERGE 语句来更新或插入。

然而,这确实要求您的源信息有足够的信息来唯一标识每一行。

例如,我们假设 SpecimenTableobjSBC.DestinationTableName,它被设置为 StagingSpecimenTable。暂存样本表是样本表结构的副本。然后,在批量复制之后,您可以使用 SqlCommand 执行此语句,

MERGE SpecimenTable AS target
USING (SELECT name,specid FROM StagingSpecimenTable) 
     AS source (StagingSpecimenTable)
ON ( target.specid = source.specid )
WHEN MATCHED 
    THEN UPDATE SET target.mame= source.name
WHEN NOT MATCHED 
    THEN INSERT  (name, specid )
    VALUES (source.name, source.specid )

然后您必须删除或截断 StagingSpecimenTable 以及 ResultTable 的类似操作

You can't do what you want using SqlBulkCopy directly. You can however bulk copy the rows into a working table and then use a MERGE statement to either update or insert.

This does require however that your source information has enough information to uniquely identify each row.

Lets assume for example that instead of SpecimenTable was the objSBC.DestinationTableName it was set to StagingSpecimenTable. StagingSpecimenTablebeing a copy of the SpecimenTable struture. Then after the bulk copy you could execute this statement using a SqlCommand

MERGE SpecimenTable AS target
USING (SELECT name,specid FROM StagingSpecimenTable) 
     AS source (StagingSpecimenTable)
ON ( target.specid = source.specid )
WHEN MATCHED 
    THEN UPDATE SET target.mame= source.name
WHEN NOT MATCHED 
    THEN INSERT  (name, specid )
    VALUES (source.name, source.specid )

You'd then have to Delete or truncate StagingSpecimenTable as well a similar operation for ResultTable

疯狂的代价 2024-12-30 18:43:03

您需要重复数据删除机制来检测之前导入了哪些记录,为此您需要有一个能够找到重复数据删除记录的逻辑。

你的逻辑是什么?例如,您可以将 SpecID 设置为主要 Dedupe 规则,这意味着如果您的 SpecID 与数据库中的数据相同,则不要导入它,否则导入它。

或者,您可以将规则的字段组合起来,例如“名称+SpecID”,甚至可以收集所有字段。在这种情况下,我建议使用一个辅助字段,您可以在其中存储 MD5(或任何其他哈希机制)来存储组合重复数据删除规则中所有字段的哈希值,然后在插入之前,您需要为新值生成哈希并检查它是否已经存在通过查询辅助字段存在于您的表中。

这可能有点令人困惑,但逻辑非常简单。如果您需要更多帮助,请告诉我:-)

You need Data deduplication mechanism to detect which records was imported before, to do so you need to have a logic to be able to find dedupe records.

What is your logic? For instance you can set SpecID as your main Dedupe Rule, which means if your SpecID was same as data that you have in your database, then don't import it otherwise import it.

Or you can have combination of fields for your rule like "Name+SpecID" or even all fields to gather. In such cases I recommend using a helper field where you can store MD5 (Or any other hash mechanism) to store hash value of combining all your fields in your dedupe rule then before insert you need to generate hash for new values and check if it's already exist in your table by querying on your helper field.

This may be a little bit confusing but logic is really simple. Please let me know if you need more help on this :-)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文