从大型 csv 文件中删除重复记录 C# .Net

发布于 2024-10-21 17:59:00 字数 1570 浏览 3 评论 0原文

我创建了一个解决方案，该解决方案读取当前大小为 20-30 mb 的大型 csv 文件，我尝试根据用户在运行时选择的某些列值使用查找重复行的常用技术来删除重复行，但它是如此速度很慢，看起来程序根本不起作用。

可以应用什么其他技术从 csv 文件中删除重复记录

这是代码，肯定我做错了什么

DataTable dtCSV = ReadCsv(file, columns);
//columns is a list of string List column
DataTable dt=RemoveDuplicateRecords(dtCSV, columns);

private DataTable RemoveDuplicateRecords(DataTable dtCSV, List<string> columns)
        {
            DataView dv = dtCSV.DefaultView;
            string RowFilter=string.Empty;

            if(dt==null)
            dt = dv.ToTable().Clone();

            DataRow row = dtCSV.Rows[0];
            foreach (DataRow row in dtCSV.Rows)
            {
                try
                {
                    RowFilter = string.Empty;

                    foreach (string column in columns)
                    {
                        string col = column;
                        RowFilter += "[" + col + "]" + "='" + row[col].ToString().Replace("'","''") + "' and ";
                    }
                    RowFilter = RowFilter.Substring(0, RowFilter.Length - 4);
                    dv.RowFilter = RowFilter;
                    DataRow dr = dt.NewRow();
                    bool result = RowExists(dt, RowFilter);
                    if (!result)
                    {
                        dr.ItemArray = dv.ToTable().Rows[0].ItemArray;
                        dt.Rows.Add(dr);

                    }

                }
                catch (Exception ex)
                {
                }
            }
            return dt;
        }

原文

I have created a solution which read a large csv file currently 20-30 mb in size, I have tried to delete the duplicate rows based on certain column values that the user chooses at run time using the usual technique of finding duplicate rows but its so slow that it seems the program is not working at all.

What other technique can be applied to remove duplicate records from a csv file

Here's the code, definitely I am doing something wrong

DataTable dtCSV = ReadCsv(file, columns);
//columns is a list of string List column
DataTable dt=RemoveDuplicateRecords(dtCSV, columns);

private DataTable RemoveDuplicateRecords(DataTable dtCSV, List<string> columns)
        {
            DataView dv = dtCSV.DefaultView;
            string RowFilter=string.Empty;

            if(dt==null)
            dt = dv.ToTable().Clone();

            DataRow row = dtCSV.Rows[0];
            foreach (DataRow row in dtCSV.Rows)
            {
                try
                {
                    RowFilter = string.Empty;

                    foreach (string column in columns)
                    {
                        string col = column;
                        RowFilter += "[" + col + "]" + "='" + row[col].ToString().Replace("'","''") + "' and ";
                    }
                    RowFilter = RowFilter.Substring(0, RowFilter.Length - 4);
                    dv.RowFilter = RowFilter;
                    DataRow dr = dt.NewRow();
                    bool result = RowExists(dt, RowFilter);
                    if (!result)
                    {
                        dr.ItemArray = dv.ToTable().Rows[0].ItemArray;
                        dt.Rows.Add(dr);

                    }

                }
                catch (Exception ex)
                {
                }
            }
            return dt;
        }

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

川水往事 2024-10-28 17:59:00

实现此目的的一种方法是遍历表，构建一个包含您感兴趣的组合列值的 HashSet。如果您尝试添加已经存在的字符串，那么你有一个重复的行。比如：

HashSet<string> ScannedRecords = new HashSet<string>();

foreach (var row in dtCSV.Rows)
{
    // Build a string that contains the combined column values
    StringBuilder sb = new StringBuilder();
    foreach (string col in columns)
    {
        sb.AppendFormat("[{0}={1}]", col, row[col].ToString());
    }

    // Try to add the string to the HashSet.
    // If Add returns false, then there is a prior record with the same values 
    if (!ScannedRecords.Add(sb.ToString())
    {
        // This record is a duplicate.
    }
}

那应该非常快。

One way to do this would be to go through the table, building a HashSet<string> that contains the combined column values you're interested in. If you try to add a string that's already there, then you have a duplicate row. Something like:

HashSet<string> ScannedRecords = new HashSet<string>();

foreach (var row in dtCSV.Rows)
{
    // Build a string that contains the combined column values
    StringBuilder sb = new StringBuilder();
    foreach (string col in columns)
    {
        sb.AppendFormat("[{0}={1}]", col, row[col].ToString());
    }

    // Try to add the string to the HashSet.
    // If Add returns false, then there is a prior record with the same values 
    if (!ScannedRecords.Add(sb.ToString())
    {
        // This record is a duplicate.
    }
}

That should be very fast.

回复收藏 0 原文

浅蓝的眸勾画不出的柔情 2024-10-28 17:59:00

如果您已将排序例程实现为几个嵌套的 for 或 foreach 循环，则可以通过按要消除重复的列对数据进行排序来优化它，然后将每一行与您查看的最后一行进行比较。

发布一些代码肯定是获得更好答案的方法，但如果不知道如何实现它，您得到的任何东西都将只是猜测。

回复收藏 0 原文

违心° 2024-10-28 17:59:00

您是否尝试过将行包装在类中并使用 Linq？

Linq 将为您提供获取不同值等的选项。

回复收藏 0 原文

内心旳酸楚 2024-10-28 17:59:00

您当前正在为每一行创建一个字符串定义的过滤条件，然后针对整个表运行该条件 - 这会很慢。

最好采用 Linq2Objects 方法，依次将每一行读取为类的实例，然后使用 Linq Distinct 运算符仅选择唯一对象（非唯一对象将被丢弃）。

代码看起来像这样：

from row in inputCSV.rows
select row.Distinct()

如果您不知道 CSV 文件将包含哪些字段，那么您可能需要稍微修改一下 - 可能使用一个对象将 CSV 单元格读入每行的列表或字典中。

对于使用 Linq 从文件中读取对象，某人或其他人的这篇文章可能会有所帮助 - http://www.developerfusion.com/article/84468/linq-to-log-files/

You're currently creating a string-defined filter condition for each and every row and then running that against the entire table - that is going to be slow.

Much better to take a Linq2Objects approach where you read each row in turn into an instance of a class and then use the Linq Distinct operator to select only unique objects (non-uniques will be thrown away).

The code would look something like:

from row in inputCSV.rows
select row.Distinct()

If you don't know the fields you're CSV file is going to have then you may have to modify this slightly - possibly using an object which reads the CSV cells into a List or Dictionary for each row.

For reading objects from file using Linq, this article by someone-or-other might help - http://www.developerfusion.com/article/84468/linq-to-log-files/

回复收藏 0 原文

茶花眉 2024-10-28 17:59:00

根据您在问题中包含的新代码，我将提供第二个答案 - 我仍然更喜欢第一个答案，但如果您必须使用 DataTable 和 DataRows，那么第二个答案可能会有所帮助：

class DataRowEqualityComparer : IEqualityComparer<DataRow>
{
    public bool Equals(DataRow x, DataRow y)
    {
        // perform cell-by-cell comparison here
        return result;
    }

    public int GetHashCode(DataRow obj)
    {
        return base.GetHashCode();
    }
}

// ...

var comparer = new DataRowEqualityComparer();
var filteredRows = from row in dtCSV.Rows
                   select row.Distinct(comparer);

Based on the new code you've included in your question, I'll provide this second answer - I still prefer the first answer, but if you have to use DataTable and DataRows, then this second answer might help:

class DataRowEqualityComparer : IEqualityComparer<DataRow>
{
    public bool Equals(DataRow x, DataRow y)
    {
        // perform cell-by-cell comparison here
        return result;
    }

    public int GetHashCode(DataRow obj)
    {
        return base.GetHashCode();
    }
}

// ...

var comparer = new DataRowEqualityComparer();
var filteredRows = from row in dtCSV.Rows
                   select row.Distinct(comparer);

回复收藏 0 原文

~没有更多了~