导入唯一的数据 - c#/sql server 解决方案
我们定期需要导入一个如下所示的 CSV:
Name,SpecID,TestResult1,TestResult2,TestResult3
Alex,ASD123,3.23,452.2,232
Craig,DFG444,453.56,345.3,23
数据以这种方式存储:
SPECIMENTABLE (name,specid,SPECIMENTABLEID)
Alex,ASD123,1
Craig,DFG444,2
并且
RESULTTABLE (testresult,result,SPECIMENTABLEID)
TestResult1,3.23,1
TestResult2,452.2,1
TestResult3,232,1
TestResult1, 453.56,2
etc
我像这样转储数据:
public void DumpQuickLabDump()
{
// T-SQL Connection
string connection = "Data Source=gaia;Initial Catalog=SalesDWH;Integrated Security=True";
// Get the data into the DataTable
//dtData = GetData(...);
// Create an object of SqlBulkCopy
SqlBulkCopy objSBC = new SqlBulkCopy(connection);
// Specify the destination table
objSBC.BulkCopyTimeout = 0;
objSBC.BatchSize = 10000;
objSBC.DestinationTableName = "SpecimenTable";
// Write the data to the SQL Server
objSBC.WriteToServer(QuickLabDump);
}
public void DumpTestResults()
{
// T-SQL Connection
string connection = "Data Source=gaia;Initial Catalog=SalesDWH;Integrated Security=True";
// Get the data into the DataTable
//dtData = GetData(...);
// Create an object of SqlBulkCopy
SqlBulkCopy objSBC = new SqlBulkCopy(connection);
// Specify the destination table
objSBC.BulkCopyTimeout = 0;
objSBC.BatchSize = 10000;
objSBC.DestinationTableName = "ResultTable";
// Write the data to the SQL Server
objSBC.WriteToServer(TestResults);
}
有时客户会向我提交一个 CSV 进行上传,然后几天后,他们会导入另一个 CSV,但是它将有一定比例的相同记录。
我如何避免重复数据? (请记住,两个表是从一个 CSV 文件填充到数据库中的)
解决方案可以是 .NET 或 sql。
太感谢了
we periodically need to import a CSV that looks like this:
Name,SpecID,TestResult1,TestResult2,TestResult3
Alex,ASD123,3.23,452.2,232
Craig,DFG444,453.56,345.3,23
the data gets stored this way:
SPECIMENTABLE (name,specid,SPECIMENTABLEID)
Alex,ASD123,1
Craig,DFG444,2
and
RESULTTABLE (testresult,result,SPECIMENTABLEID)
TestResult1,3.23,1
TestResult2,452.2,1
TestResult3,232,1
TestResult1, 453.56,2
etc
im dumping the data like this:
public void DumpQuickLabDump()
{
// T-SQL Connection
string connection = "Data Source=gaia;Initial Catalog=SalesDWH;Integrated Security=True";
// Get the data into the DataTable
//dtData = GetData(...);
// Create an object of SqlBulkCopy
SqlBulkCopy objSBC = new SqlBulkCopy(connection);
// Specify the destination table
objSBC.BulkCopyTimeout = 0;
objSBC.BatchSize = 10000;
objSBC.DestinationTableName = "SpecimenTable";
// Write the data to the SQL Server
objSBC.WriteToServer(QuickLabDump);
}
public void DumpTestResults()
{
// T-SQL Connection
string connection = "Data Source=gaia;Initial Catalog=SalesDWH;Integrated Security=True";
// Get the data into the DataTable
//dtData = GetData(...);
// Create an object of SqlBulkCopy
SqlBulkCopy objSBC = new SqlBulkCopy(connection);
// Specify the destination table
objSBC.BulkCopyTimeout = 0;
objSBC.BatchSize = 10000;
objSBC.DestinationTableName = "ResultTable";
// Write the data to the SQL Server
objSBC.WriteToServer(TestResults);
}
sometimes the client will submit a CSV to me for uploading and then several days later, they will import another CSV but it will have a percentage of the SAME RECORDS.
how do i avoid duplicating the data? (please keep in mind that two tables are getting populated in the database from the one CSV file)
the solution can be .NET or sql.
thank you so much
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您无法直接使用
SqlBulkCopy
执行您想要的操作。但是,您可以将行批量复制到工作表中,然后使用MERGE
语句来更新或插入。然而,这确实要求您的源信息有足够的信息来唯一标识每一行。
例如,我们假设
SpecimenTable
是objSBC.DestinationTableName
,它被设置为StagingSpecimenTable
。暂存样本表是样本表结构的副本。然后,在批量复制之后,您可以使用 SqlCommand 执行此语句,然后您必须删除或截断 StagingSpecimenTable 以及 ResultTable 的类似操作
You can't do what you want using
SqlBulkCopy
directly. You can however bulk copy the rows into a working table and then use aMERGE
statement to either update or insert.This does require however that your source information has enough information to uniquely identify each row.
Lets assume for example that instead of
SpecimenTable
was theobjSBC.DestinationTableName
it was set toStagingSpecimenTable
. StagingSpecimenTablebeing a copy of the SpecimenTable struture. Then after the bulk copy you could execute this statement using a SqlCommandYou'd then have to Delete or truncate StagingSpecimenTable as well a similar operation for ResultTable
您需要重复数据删除机制来检测之前导入了哪些记录,为此您需要有一个能够找到重复数据删除记录的逻辑。
你的逻辑是什么?例如,您可以将 SpecID 设置为主要 Dedupe 规则,这意味着如果您的 SpecID 与数据库中的数据相同,则不要导入它,否则导入它。
或者,您可以将规则的字段组合起来,例如“名称+SpecID”,甚至可以收集所有字段。在这种情况下,我建议使用一个辅助字段,您可以在其中存储 MD5(或任何其他哈希机制)来存储组合重复数据删除规则中所有字段的哈希值,然后在插入之前,您需要为新值生成哈希并检查它是否已经存在通过查询辅助字段存在于您的表中。
这可能有点令人困惑,但逻辑非常简单。如果您需要更多帮助,请告诉我:-)
You need Data deduplication mechanism to detect which records was imported before, to do so you need to have a logic to be able to find dedupe records.
What is your logic? For instance you can set SpecID as your main Dedupe Rule, which means if your SpecID was same as data that you have in your database, then don't import it otherwise import it.
Or you can have combination of fields for your rule like "Name+SpecID" or even all fields to gather. In such cases I recommend using a helper field where you can store MD5 (Or any other hash mechanism) to store hash value of combining all your fields in your dedupe rule then before insert you need to generate hash for new values and check if it's already exist in your table by querying on your helper field.
This may be a little bit confusing but logic is really simple. Please let me know if you need more help on this :-)