从大型 csv 文件中删除重复记录 C# .Net
我创建了一个解决方案,该解决方案读取当前大小为 20-30 mb 的大型 csv 文件,我尝试根据用户在运行时选择的某些列值使用查找重复行的常用技术来删除重复行,但它是如此速度很慢,看起来程序根本不起作用。
可以应用什么其他技术从 csv 文件中删除重复记录
这是代码,肯定我做错了什么
DataTable dtCSV = ReadCsv(file, columns); //columns is a list of string List column DataTable dt=RemoveDuplicateRecords(dtCSV, columns); private DataTable RemoveDuplicateRecords(DataTable dtCSV, List<string> columns) { DataView dv = dtCSV.DefaultView; string RowFilter=string.Empty; if(dt==null) dt = dv.ToTable().Clone(); DataRow row = dtCSV.Rows[0]; foreach (DataRow row in dtCSV.Rows) { try { RowFilter = string.Empty; foreach (string column in columns) { string col = column; RowFilter += "[" + col + "]" + "='" + row[col].ToString().Replace("'","''") + "' and "; } RowFilter = RowFilter.Substring(0, RowFilter.Length - 4); dv.RowFilter = RowFilter; DataRow dr = dt.NewRow(); bool result = RowExists(dt, RowFilter); if (!result) { dr.ItemArray = dv.ToTable().Rows[0].ItemArray; dt.Rows.Add(dr); } } catch (Exception ex) { } } return dt; }
I have created a solution which read a large csv file currently 20-30 mb in size, I have tried to delete the duplicate rows based on certain column values that the user chooses at run time using the usual technique of finding duplicate rows but its so slow that it seems the program is not working at all.
What other technique can be applied to remove duplicate records from a csv file
Here's the code, definitely I am doing something wrong
DataTable dtCSV = ReadCsv(file, columns); //columns is a list of string List column DataTable dt=RemoveDuplicateRecords(dtCSV, columns); private DataTable RemoveDuplicateRecords(DataTable dtCSV, List<string> columns) { DataView dv = dtCSV.DefaultView; string RowFilter=string.Empty; if(dt==null) dt = dv.ToTable().Clone(); DataRow row = dtCSV.Rows[0]; foreach (DataRow row in dtCSV.Rows) { try { RowFilter = string.Empty; foreach (string column in columns) { string col = column; RowFilter += "[" + col + "]" + "='" + row[col].ToString().Replace("'","''") + "' and "; } RowFilter = RowFilter.Substring(0, RowFilter.Length - 4); dv.RowFilter = RowFilter; DataRow dr = dt.NewRow(); bool result = RowExists(dt, RowFilter); if (!result) { dr.ItemArray = dv.ToTable().Rows[0].ItemArray; dt.Rows.Add(dr); } } catch (Exception ex) { } } return dt; }
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
实现此目的的一种方法是遍历表,构建一个包含您感兴趣的组合列值的
HashSet
。如果您尝试添加已经存在的字符串,那么你有一个重复的行。比如:那应该非常快。
One way to do this would be to go through the table, building a
HashSet<string>
that contains the combined column values you're interested in. If you try to add a string that's already there, then you have a duplicate row. Something like:That should be very fast.
如果您已将排序例程实现为几个嵌套的
for
或foreach
循环,则可以通过按要消除重复的列对数据进行排序来优化它,然后将每一行与您查看的最后一行进行比较。发布一些代码肯定是获得更好答案的方法,但如果不知道如何实现它,您得到的任何东西都将只是猜测。
If you've implemented your sorting routine as a couple of nested
for
orforeach
loops, you could optimise it by sorting the data by the columns you wish to de-duplicate against, and simply compare each row to the last row you looked at.Posting some code is a sure-fire way to get better answers though, without an idea of how you've implemented it anything you get will just be conjecture.
您是否尝试过将行包装在类中并使用 Linq?
Linq 将为您提供获取不同值等的选项。
Have you tried Wrapping the rows in a class and using Linq?
Linq will give you options to get distinct values etc.
您当前正在为每一行创建一个字符串定义的过滤条件,然后针对整个表运行该条件 - 这会很慢。
最好采用 Linq2Objects 方法,依次将每一行读取为类的实例,然后使用 Linq Distinct 运算符仅选择唯一对象(非唯一对象将被丢弃)。
代码看起来像这样:
如果您不知道 CSV 文件将包含哪些字段,那么您可能需要稍微修改一下 - 可能使用一个对象将 CSV 单元格读入每行的列表或字典中。
对于使用 Linq 从文件中读取对象,某人或其他人的这篇文章可能会有所帮助 - http://www.developerfusion.com/article/84468/linq-to-log-files/
You're currently creating a string-defined filter condition for each and every row and then running that against the entire table - that is going to be slow.
Much better to take a Linq2Objects approach where you read each row in turn into an instance of a class and then use the Linq Distinct operator to select only unique objects (non-uniques will be thrown away).
The code would look something like:
If you don't know the fields you're CSV file is going to have then you may have to modify this slightly - possibly using an object which reads the CSV cells into a List or Dictionary for each row.
For reading objects from file using Linq, this article by someone-or-other might help - http://www.developerfusion.com/article/84468/linq-to-log-files/
根据您在问题中包含的新代码,我将提供第二个答案 - 我仍然更喜欢第一个答案,但如果您必须使用
DataTable
和DataRows
,那么第二个答案可能会有所帮助:Based on the new code you've included in your question, I'll provide this second answer - I still prefer the first answer, but if you have to use
DataTable
andDataRows
, then this second answer might help: