检测具有相同子项的实体
我有两个实体,Class
和 Student
,以多对多关系链接。
不幸的是,当从外部应用程序导入数据时,会重复创建一些类。 “重复”班级名称不同,但科目相同,学生也相同。
例如:
{ Id = 341,标题 = '10rs/PE1a',SubjectId = 60,学生 = { Jack, Bill, Sarah } }
{ Id = 429,标题 = '10rs/PE1b',SubjectId = 60,学生 = { Jack, Bill, Sarah } }
没有匹配这些重复类的名称的通用规则,因此识别两个类重复的唯一方法是它们具有相同的名称主题 ID 和学生。
我想使用 LINQ 来检测所有重复项(并最终合并它们)。到目前为止我已经尝试过:
var sb = new StringBuilder();
using (var ctx = new Ctx()) {
ctx.CommandTimeout = 10000; // Because the next line takes so long!
var allClasses = ctx.Classes.Include("Students").OrderBy(o => o.Id);
foreach (var c in allClasses) {
var duplicates = allClasses.Where(o => o.SubjectId == c.SubjectId && o.Id != c.Id && o.Students.Equals(c.Students));
foreach (var d in duplicates)
sb.Append(d.LongName).Append(" is a duplicate of ").Append(c.LongName).Append("<br />");
}
}
lblResult.Text = sb.ToString();
这不好,因为我收到错误:
NotSupportedException:无法创建“TeachEDM.Student”类型的常量值。此上下文仅支持原始类型(“例如 Int32、String 和 Guid”)。
显然它不喜欢我尝试在 LINQ 中匹配 o.SubjectId == c.SubjectId
。
而且,这似乎是一种可怕的方法,而且速度非常慢。调用数据库需要5分钟以上。
我真的很感激一些建议。
I have two entities, Class
and Student
, linked in a many-to-many relationship.
When data is imported from an external application, unfortunately some classes are created in duplicate. The 'duplicate' classes have different names, but the same subject and the same students.
For example:
{ Id = 341, Title = '10rs/PE1a', SubjectId = 60, Students = { Jack, Bill, Sarah } }
{ Id = 429, Title = '10rs/PE1b', SubjectId = 60, Students = { Jack, Bill, Sarah } }
There is no general rule for matching the names of these duplicate classes, so the only way to identify that two classes are duplicates is that they have the same SubjectId and Students.
I'd like to use LINQ to detect all duplicates (and ultimately merge them). So far I have tried:
var sb = new StringBuilder();
using (var ctx = new Ctx()) {
ctx.CommandTimeout = 10000; // Because the next line takes so long!
var allClasses = ctx.Classes.Include("Students").OrderBy(o => o.Id);
foreach (var c in allClasses) {
var duplicates = allClasses.Where(o => o.SubjectId == c.SubjectId && o.Id != c.Id && o.Students.Equals(c.Students));
foreach (var d in duplicates)
sb.Append(d.LongName).Append(" is a duplicate of ").Append(c.LongName).Append("<br />");
}
}
lblResult.Text = sb.ToString();
This is no good because I get the error:
NotSupportedException: Unable to create a constant value of type 'TeachEDM.Student'. Only primitive types ('such as Int32, String, and Guid') are supported in this context.
Evidently it doesn't like me trying to match o.SubjectId == c.SubjectId
in LINQ.
Also, this seems a horrible method in general and is very slow. The call to the database takes more than 5 minutes.
I'd really appreciate some advice.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
SubjectId
的比较不是问题,因为c.SubjectId
是原始类型的值(我猜是int
)。该异常抱怨Equals(c.Students)
。c.Students
是一个常量(相对于查询duplicates
),但不是原始类型。我还会尝试在内存中而不是在数据库中进行比较。无论如何,当您启动第一个
foreach
循环时,您都会将整个数据加载到内存中:它执行查询allClasses
。然后在循环内部将 IQueryable allClasses 扩展为 IQueryable duplicates ,然后在内部 foreach 循环中执行它。这是外循环的每个元素一个数据库查询!这可以解释代码性能不佳的原因。所以我会尝试在内存中执行第一个
foreach
的内容。为了比较Students
列表,有必要逐个元素进行比较,而不是对 Students 集合的引用,因为它们肯定是不同的。按名称对序列进行排序是必要的,因为我相信
SequenceEqual
会比较序列的长度,然后比较元素 0 与元素 0,然后比较元素 1 与元素 1,依此类推。编辑对于您的评论,第一个查询仍然很慢。
如果您有 1300 个班级,每个班级有 30 名学生,则急切加载 (
Include
) 的性能可能会因数据库和客户端之间传输的数据倍增而受到影响。此处对此进行了解释:我可以在 EntityFramework 中的 ObjectSet 上使用多少个 Include 来保持性能? .查询很复杂,因为它需要班级和学生之间的JOIN
,并且对象物化也很复杂,因为 EF 必须在创建对象时过滤掉重复的数据。另一种方法是在第一个查询中仅加载没有学生的课程,然后在循环内显式地逐个加载学生。它看起来像这样:
在此示例中,您将有 1 + 1300 个数据库查询,而不是只有一个,但不会出现急切加载时发生的数据乘法,并且查询更简单(没有
JOIN
班级和学生之间)。显式加载解释如下:
EntityObject
派生实体):http://msdn.microsoft.com/en-us/library/dd456855.aspx< /a>EntityObject
派生实体,您还可以使用EntityCollection
的Load
方法:http://msdn.microsoft.com/en-us/library/bb896370.aspx< /a>如果您使用延迟加载,则不需要使用
LoadProperty
来加载第一个foreach
,因为Students
集合将在您第一次访问时加载。它应该会导致同样的 1300 个额外查询,例如显式加载。The comparison of the
SubjectId
is not the problem becausec.SubjectId
is a value of a primitive type (int
, I guess). The exception complains aboutEquals(c.Students)
.c.Students
is a constant (with respect to the queryduplicates
) but not a primitive type.I would also try to do the comparison in memory and not in the database. You are loading the whole data into memory anyway when you start your first
foreach
loop: It executes the queryallClasses
. Then inside of the loop you extend the IQueryableallClasses
to the IQueryableduplicates
which gets executed then in the innerforeach
loop. This is one database query per element of your outer loop! This could explain the poor performance of the code.So I would try to perform the content of the first
foreach
in memory. For the comparison of theStudents
list it is necessary to compare element by element, not the references to the Students collections because they are for sure different.Ordering the sequences by name is necessary because, I believe,
SequenceEqual
compares length of the sequence and then element 0 with element 0, then element 1 with element 1 and so on.Edit To your comment that the first query is still slow.
If you have 1300 classes with 30 students each the performance of eager loading (
Include
) could suffer from the multiplication of data which are transfered between database and client. This is explained here: How many Include I can use on ObjectSet in EntityFramework to retain performance? . The query is complex because it needs aJOIN
between classes and students and object materialization is complex as well because EF must filter out the duplicated data when the objects are created.An alternative approach is to load only the classes without the students in the first query and then load the students one by one inside of a loop explicitely. It would look like this:
You would have 1 + 1300 database queries in this example instead of only one, but you won't have the data multiplication which occurs with eager loading and the queries are simpler (no
JOIN
between classes and students).Explicite loading is explained here:
EntityObject
derived entities): http://msdn.microsoft.com/en-us/library/dd456855.aspxEntityObject
derived entities you can also use theLoad
method ofEntityCollection
: http://msdn.microsoft.com/en-us/library/bb896370.aspxIf you work with Lazy Loading the first
foreach
withLoadProperty
would not be necessary as theStudents
collections will be loaded the first time you access it. It should result in the same 1300 additional queries like explicite loading.