比较 DataTable 中的所有行 - 识别重复记录

发布于 2024-07-14 21:50:20 字数 1105 浏览 9 评论 0原文

我想在没有键的情况下规范化 DataTable insertRows 中的数据。为此，我需要通过查找 ID (import_id) 来识别和标记重复记录。之后我将只选择不同的。我正在考虑的方法是将每一行与该 DataTable insertRows 中的所有行进行比较

DataTable 中的列在设计时是未知的，并且没有键。从性能角度来看，该表将包含多达 10k 到 20k 条记录和大约 40 列。

如何在不牺牲太多性能的情况下实现这一目标？

我尝试使用 linq 但我不知道如何动态指定 where 标准在这里，我在循环中比较每行的名字和姓氏

foreach（importDataTable.Rows 中的 System.Data.DataRow lrows） 
  { 
      IEnumerable   insertRows = importDataTable.Rows.Cast(); 

      var col_matches = 
      来自 insertRows 中的 irows 
      在哪里 
      String.Compare(irows["fname"].ToString(), lrows["fname"].ToString(), true).Equals(0) 
      && 
      String.Compare(irows["last_name"].ToString(), lrows["last_name"].ToString(),true).Equals(0) 

      选择新的{ import_id = irows["import_id"].ToString() }; 
  }

欢迎任何想法。如何使用 linq 查找类似的列名称？>我的类似问题

原文

I would like to normalize data in a DataTable insertRows without a key. To do that I need to identify and mark duplicate records by finding their ID (import_id). Afterwards I will select only the distinct ones. The approach I am thinking of is to compare each row against all rows in that DataTable insertRows

The columns in the DataTable are not known at design time, and there is no key. Performance-wise, the table would have as much as 10k to 20k records and about 40 columns

How do I accomplish this without sacrificing performance too much?

I attempted using linq but I did not know how to dynamically specify the where criteria
Here I am comparing first and last names in a loop for each row

foreach (System.Data.DataRow lrows in importDataTable.Rows)
{
    IEnumerable<System.Data.DataRow> insertRows = importDataTable.Rows.Cast<System.Data.DataRow>();

    var col_matches =
    from irows in insertRows
    where
    String.Compare(irows["fname"].ToString(), lrows["fname"].ToString(), true).Equals(0)
    &&
    String.Compare(irows["last_name"].ToString(), lrows["last_name"].ToString(),true).Equals(0)

    select new { import_id = irows["import_id"].ToString() };
}

Any ideas are welcome.
How do I find similar column names using linq?>my similar question

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

手长情犹 2024-07-21 21:50:20

在没有 O(n²) 复杂度的情况下完成此操作的最简单方法是使用有效实现 Set 操作（特别是 Contains 操作）的数据结构。幸运的是，.NET（从 3.0 开始）包含 HashSet 对象，它确实这是给你的。为了利用它，您将需要一个对象来封装数据表中的一行。

如果 DataRow 不起作用，我建议将相关记录转换为字符串，将它们连接起来，然后将它们放入 HashSet 中。在插入行之前，请检查 HashSet 是否已包含该行（使用 Contains）。如果是，则您已找到重复项。

编辑：

此方法的复杂度为 O(n)。

回复收藏 0 原文

旧时光的容颜 2024-07-21 21:50:20

我不确定我是否正确理解了这个问题，但是在处理 System.Data.DataTable 时，以下内容应该有效。

for (Int32 r0 = 0; r0 < dataTable.Rows.Count; r0++)
{
   for (Int32 r1 = r0 + 1; r1 < dataTable.Rows.Count; r1++)
   {
      Boolean rowsEqual = true;

      for (Int32 c = 0; c < dataTable.Columns.Count; c++)
      {
         if (!Object.Equals(dataTable.Rows[r0][c], dataTable.Rows[r1][c])
         {
            rowsEqual = false;
            break;
         }
      }

      if (rowsEqual)
      {
         Console.WriteLine(
            String.Format("Row {0} is a duplicate of row {1}.", r0, r1))
      }
   }
}

I am not sure if I understand the question correctly, but when dealing with System.Data.DataTable the following should work.

for (Int32 r0 = 0; r0 < dataTable.Rows.Count; r0++)
{
   for (Int32 r1 = r0 + 1; r1 < dataTable.Rows.Count; r1++)
   {
      Boolean rowsEqual = true;

      for (Int32 c = 0; c < dataTable.Columns.Count; c++)
      {
         if (!Object.Equals(dataTable.Rows[r0][c], dataTable.Rows[r1][c])
         {
            rowsEqual = false;
            break;
         }
      }

      if (rowsEqual)
      {
         Console.WriteLine(
            String.Format("Row {0} is a duplicate of row {1}.", r0, r1))
      }
   }
}

回复收藏 0 原文