检测具有相同子项的实体

发布于 2024-12-10 12:57:31 字数 1332 浏览 0 评论 0原文

我有两个实体,ClassStudent,以多对多关系链接。

不幸的是,当从外部应用程序导入数据时,会重复创建一些类。 “重复”班级名称不同,但科目相同,学生也相同。

例如:

{ Id = 341,标题 = '10rs/PE1a',SubjectId = 60,学生 = { Jack, Bill, Sarah } }

{ Id = 429,标题 = '10rs/PE1b',SubjectId = 60,学生 = { Jack, Bill, Sarah } }

没有匹配这些重复类的名称的通用规则,因此识别两个类重复的唯一方法是它们具有相同的名称主题 ID学生

我想使用 LINQ 来检测所有重复项(并最终合并它们)。到目前为止我已经尝试过:

var sb = new StringBuilder();
using (var ctx = new Ctx()) {
  ctx.CommandTimeout = 10000; // Because the next line takes so long!
  var allClasses = ctx.Classes.Include("Students").OrderBy(o => o.Id);
  foreach (var c in allClasses) {
    var duplicates = allClasses.Where(o => o.SubjectId == c.SubjectId && o.Id != c.Id && o.Students.Equals(c.Students));
    foreach (var d in duplicates)
      sb.Append(d.LongName).Append(" is a duplicate of ").Append(c.LongName).Append("<br />");
  }
}
lblResult.Text = sb.ToString();

这不好,因为我收到错误:

NotSupportedException:无法创建“TeachEDM.Student”类型的常量值。此上下文仅支持原始类型(“例如 Int32、String 和 Guid”)。

显然它不喜欢我尝试在 LINQ 中匹配 o.SubjectId == c.SubjectId

而且,这似乎是一种可怕的方法,而且速度非常慢。调用数据库需要5分钟以上。

我真的很感激一些建议。

I have two entities, Class and Student, linked in a many-to-many relationship.

When data is imported from an external application, unfortunately some classes are created in duplicate. The 'duplicate' classes have different names, but the same subject and the same students.

For example:

{ Id = 341, Title = '10rs/PE1a', SubjectId = 60, Students = { Jack, Bill, Sarah } }

{ Id = 429, Title = '10rs/PE1b', SubjectId = 60, Students = { Jack, Bill, Sarah } }

There is no general rule for matching the names of these duplicate classes, so the only way to identify that two classes are duplicates is that they have the same SubjectId and Students.

I'd like to use LINQ to detect all duplicates (and ultimately merge them). So far I have tried:

var sb = new StringBuilder();
using (var ctx = new Ctx()) {
  ctx.CommandTimeout = 10000; // Because the next line takes so long!
  var allClasses = ctx.Classes.Include("Students").OrderBy(o => o.Id);
  foreach (var c in allClasses) {
    var duplicates = allClasses.Where(o => o.SubjectId == c.SubjectId && o.Id != c.Id && o.Students.Equals(c.Students));
    foreach (var d in duplicates)
      sb.Append(d.LongName).Append(" is a duplicate of ").Append(c.LongName).Append("<br />");
  }
}
lblResult.Text = sb.ToString();

This is no good because I get the error:

NotSupportedException: Unable to create a constant value of type 'TeachEDM.Student'. Only primitive types ('such as Int32, String, and Guid') are supported in this context.

Evidently it doesn't like me trying to match o.SubjectId == c.SubjectId in LINQ.

Also, this seems a horrible method in general and is very slow. The call to the database takes more than 5 minutes.

I'd really appreciate some advice.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

罗罗贝儿 2024-12-17 12:57:31

SubjectId 的比较不是问题,因为 c.SubjectId 是原始类型的值(我猜是 int)。该异常抱怨 Equals(c.Students)c.Students 是一个常量(相对于查询duplicates),但不是原始类型。

我还会尝试在内存中而不是在数据库中进行比较。无论如何,当您启动第一个 foreach 循环时,您都会将整个数据加载到内存中:它执行查询 allClasses。然后在循环内部将 IQueryable allClasses 扩展为 IQueryable duplicates ,然后在内部 foreach 循环中执行它。这是外循环的每个元素一个数据库查询!这可以解释代码性能不佳的原因。

所以我会尝试在内存中执行第一个foreach的内容。为了比较 Students 列表,有必要逐个元素进行比较,而不是对 Students 集合的引用,因为它们肯定是不同的。

var sb = new StringBuilder();
using (var ctx = new Ctx())
{
    ctx.CommandTimeout = 10000; // Perhaps not necessary anymore
    var allClasses = ctx.Classes.Include("Students").OrderBy(o => o.Id)
        .ToList(); // executes query, allClasses is now a List, not an IQueryable

    // everything from here runs in memory
    foreach (var c in allClasses)
    {
        var duplicates = allClasses.Where(
           o => o.SubjectId == c.SubjectId &&
           o.Id != c.Id &&
           o.Students.OrderBy(s => s.Name).Select(s => s.Name)
            .SequenceEqual(c.Students.OrderBy(s => s.Name).Select(s => s.Name)));

        // duplicates is an IEnumerable, not an IQueryable
        foreach (var d in duplicates)
            sb.Append(d.LongName)
              .Append(" is a duplicate of ")
              .Append(c.LongName)
              .Append("<br />");
    }
}
lblResult.Text = sb.ToString();

按名称对序列进行排序是必要的,因为我相信 SequenceEqual 会比较序列的长度,然后比较元素 0 与元素 0,然后比较元素 1 与元素 1,依此类推。


编辑对于您的评论,第一个查询仍然很慢。

如果您有 1300 个班级,每个班级有 30 名学生,则急切加载 (Include) 的性能可能会因数据库和客户端之间传输的数据倍增而受到影响。此处对此进行了解释:我可以在 EntityFramework 中的 ObjectSet 上使用多少个 Include 来保持性能? .查询很复杂,因为它需要班级和学生之间的 JOIN,并且对象物化也很复杂,因为 EF 必须在创建对象时过滤掉重复的数据。

另一种方法是在第一个查询中仅加载没有学生的课程,然后在循环内显式地逐个加载学生。它看起来像这样:

var sb = new StringBuilder();
using (var ctx = new Ctx())
{
    ctx.CommandTimeout = 10000; // Perhaps not necessary anymore
    var allClasses = ctx.Classes.OrderBy(o => o.Id).ToList(); // <- No Include!
    foreach (var c in allClasses)
    {
        // "Explicite loading": This is a new roundtrip to the DB
        ctx.LoadProperty(c, "Students");
    }

    foreach (var c in allClasses)
    {
        // ... same code as above
    }
}
lblResult.Text = sb.ToString();

在此示例中,您将有 1 + 1300 个数据库查询,而不是只有一个,但不会出现急切加载时发生的数据乘法,并且查询更简单(没有 JOIN 班级和学生之间)。

显式加载解释如下:

如果您使用延迟加载,则不需要使用 LoadProperty 来加载第一个 foreach,因为 Students 集合将在您第一次访问时加载。它应该会导致同样的 1300 个额外查询,例如显式加载。

The comparison of the SubjectId is not the problem because c.SubjectId is a value of a primitive type (int, I guess). The exception complains about Equals(c.Students). c.Students is a constant (with respect to the query duplicates) but not a primitive type.

I would also try to do the comparison in memory and not in the database. You are loading the whole data into memory anyway when you start your first foreach loop: It executes the query allClasses. Then inside of the loop you extend the IQueryable allClasses to the IQueryable duplicates which gets executed then in the inner foreach loop. This is one database query per element of your outer loop! This could explain the poor performance of the code.

So I would try to perform the content of the first foreach in memory. For the comparison of the Students list it is necessary to compare element by element, not the references to the Students collections because they are for sure different.

var sb = new StringBuilder();
using (var ctx = new Ctx())
{
    ctx.CommandTimeout = 10000; // Perhaps not necessary anymore
    var allClasses = ctx.Classes.Include("Students").OrderBy(o => o.Id)
        .ToList(); // executes query, allClasses is now a List, not an IQueryable

    // everything from here runs in memory
    foreach (var c in allClasses)
    {
        var duplicates = allClasses.Where(
           o => o.SubjectId == c.SubjectId &&
           o.Id != c.Id &&
           o.Students.OrderBy(s => s.Name).Select(s => s.Name)
            .SequenceEqual(c.Students.OrderBy(s => s.Name).Select(s => s.Name)));

        // duplicates is an IEnumerable, not an IQueryable
        foreach (var d in duplicates)
            sb.Append(d.LongName)
              .Append(" is a duplicate of ")
              .Append(c.LongName)
              .Append("<br />");
    }
}
lblResult.Text = sb.ToString();

Ordering the sequences by name is necessary because, I believe, SequenceEqual compares length of the sequence and then element 0 with element 0, then element 1 with element 1 and so on.


Edit To your comment that the first query is still slow.

If you have 1300 classes with 30 students each the performance of eager loading (Include) could suffer from the multiplication of data which are transfered between database and client. This is explained here: How many Include I can use on ObjectSet in EntityFramework to retain performance? . The query is complex because it needs a JOIN between classes and students and object materialization is complex as well because EF must filter out the duplicated data when the objects are created.

An alternative approach is to load only the classes without the students in the first query and then load the students one by one inside of a loop explicitely. It would look like this:

var sb = new StringBuilder();
using (var ctx = new Ctx())
{
    ctx.CommandTimeout = 10000; // Perhaps not necessary anymore
    var allClasses = ctx.Classes.OrderBy(o => o.Id).ToList(); // <- No Include!
    foreach (var c in allClasses)
    {
        // "Explicite loading": This is a new roundtrip to the DB
        ctx.LoadProperty(c, "Students");
    }

    foreach (var c in allClasses)
    {
        // ... same code as above
    }
}
lblResult.Text = sb.ToString();

You would have 1 + 1300 database queries in this example instead of only one, but you won't have the data multiplication which occurs with eager loading and the queries are simpler (no JOIN between classes and students).

Explicite loading is explained here:

If you work with Lazy Loading the first foreach with LoadProperty would not be necessary as the Students collections will be loaded the first time you access it. It should result in the same 1300 additional queries like explicite loading.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文