为什么.Contains 慢?通过主键获取多个实体的最有效方法?
通过主键选择多个实体的最有效方法是什么?
public IEnumerable<Models.Image> GetImagesById(IEnumerable<int> ids)
{
//return ids.Select(id => Images.Find(id)); //is this cool?
return Images.Where( im => ids.Contains(im.Id)); //is this better, worse or the same?
//is there a (better) third way?
}
我意识到我可以做一些性能测试来进行比较,但我想知道实际上是否有比这两种方法更好的方法,并且正在寻找一旦它们被“翻译”,就可以了解这两个查询之间的区别(如果有的话)。
What's the most efficient way to select multiple entities by primary key?
public IEnumerable<Models.Image> GetImagesById(IEnumerable<int> ids)
{
//return ids.Select(id => Images.Find(id)); //is this cool?
return Images.Where( im => ids.Contains(im.Id)); //is this better, worse or the same?
//is there a (better) third way?
}
I realise that I could do some performance tests to compare, but I am wondering if there is in fact a better way than both, and am looking for some enlightenment on what the difference between these two queries is, if any, once they have been 'translated'.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
更新:通过在 EF6 中添加 InExpression,处理 Enumerable.Contains 的性能显着提高。这个答案中的分析很棒,但自 2013 年以来基本上已过时。
在实体框架中使用
Contains
实际上非常慢。确实,它会转换为 SQL 中的IN
子句,并且 SQL 查询本身执行速度很快。但问题和性能瓶颈在于从 LINQ 查询到 SQL 的转换。将创建的表达式树将扩展为一长串OR
连接,因为没有表示IN
的本机表达式。创建 SQL 时,许多OR
的表达式会被识别并折叠回 SQLIN
子句中。这并不意味着使用
Contains
比为ids
集合中的每个元素发出一个查询(您的第一个选项)更糟糕。它可能仍然更好 - 至少对于不太大的集合来说。但对于大型收藏来说,这确实很糟糕。我记得不久前我测试过一个包含大约 12.000 个元素的Contains
查询,该查询有效,但需要大约一分钟的时间,尽管 SQL 中的查询执行时间不到一秒。可能值得测试多个往返数据库的组合的性能,并且每个往返的
Contains
表达式中的元素数量较少。此处显示并解释了这种方法以及将
Contains
与实体框架一起使用的限制:为什么 Contains() 运算符会降低实体框架的性能戏剧性地?
在这种情况下,原始 SQL 命令可能会表现最佳,这意味着您调用 dbContext.Database.SqlQuery(sqlString) 或 dbContext.Images .SqlQuery(sqlString) 其中
sqlString
是 @Rune 答案中显示的 SQL。编辑
以下是一些测量结果:
我在一个包含 550000 条记录和 11 列的表上完成了此操作(ID 从 1 开始,没有间隙),并随机选取了 20000 个 id:
测试 1
结果-> 毫秒 = 85.5 秒
测试 2
结果 -> 毫秒 = 84.5 秒
AsNoTracking
的这种微小效果非常不寻常。它表明瓶颈不是对象物化(也不是 SQL,如下所示)。对于这两个测试,可以在 SQL Profiler 中看到 SQL 查询很晚才到达数据库。 (我没有精确测量,但晚于 70 秒。)显然,将此 LINQ 查询转换为 SQL 的成本非常昂贵。
测试3
结果-> 毫秒 = 5.1 秒
测试 4
结果 -> 毫秒 = 3.8 秒
这次禁用跟踪的效果更加明显。
测试 5
结果 -> msec = 3.7 sec
我的理解是
context.Database.SqlQuery(sql)
与context.Set().SqlQuery 相同(sql).AsNoTracking()
,因此测试 4 和测试 5 之间没有预期的差异。(由于可能的原因,结果集的长度并不总是相同随机 ID 选择后会重复,但总是在 19600 到 19640 个元素之间。)
编辑 2
测试 6
即使到数据库的 20000 次往返也比使用
Contains< 更快/代码>:
结果-> msec = 73.6 sec
请注意,我使用了
SingleOrDefault
而不是Find
。使用与Find
相同的代码非常慢(几分钟后我取消了测试),因为Find
在内部调用DetectChanges
。禁用自动更改检测 (context.Configuration.AutoDetectChangesEnabled = false
) 会导致与SingleOrDefault
大致相同的性能。使用AsNoTracking
可以将时间缩短一到两秒。测试是在同一台计算机上使用数据库客户端(控制台应用程序)和数据库服务器完成的。由于多次往返,“远程”数据库的最后结果可能会变得更糟。
UPDATE: With the addition of InExpression in EF6, the performance of processing Enumerable.Contains improved dramatically. The analysis in this answer is great but largely obsolete since 2013.
Using
Contains
in Entity Framework is actually very slow. It's true that it translates into anIN
clause in SQL and that the SQL query itself is executed fast. But the problem and the performance bottleneck is in the translation from your LINQ query into SQL. The expression tree which will be created is expanded into a long chain ofOR
concatenations because there is no native expression which represents anIN
. When the SQL is created this expression of manyOR
s is recognized and collapsed back into the SQLIN
clause.This does not mean that using
Contains
is worse than issuing one query per element in yourids
collection (your first option). It's probably still better - at least for not too large collections. But for large collections it is really bad. I remember that I had tested some time ago aContains
query with about 12.000 elements which worked but took around a minute even though the query in SQL executed in less than a second.It might be worth to test the performance of a combination of multiple roundtrips to the database with a smaller number of elements in a
Contains
expression for each roundtrip.This approach and also the limitations of using
Contains
with Entity Framework is shown and explained here:Why does the Contains() operator degrade Entity Framework's performance so dramatically?
It's possible that a raw SQL command will perform best in this situation which would mean that you call
dbContext.Database.SqlQuery<Image>(sqlString)
ordbContext.Images.SqlQuery(sqlString)
wheresqlString
is the SQL shown in @Rune's answer.Edit
Here are some measurements:
I have done this on a table with 550000 records and 11 columns (IDs start from 1 without gaps) and picked randomly 20000 ids:
Test 1
Result -> msec = 85.5 sec
Test 2
Result -> msec = 84.5 sec
This tiny effect of
AsNoTracking
is very unusual. It indicates that the bottleneck is not object materialization (and not SQL as shown below).For both tests it can be seen in SQL Profiler that the SQL query arrives at the database very late. (I didn't measure exactly but it was later than 70 seconds.) Obviously the translation of this LINQ query into SQL is very expensive.
Test 3
Result -> msec = 5.1 sec
Test 4
Result -> msec = 3.8 sec
This time the effect of disabling tracking is more noticable.
Test 5
Result -> msec = 3.7 sec
My understanding is that
context.Database.SqlQuery<MyEntity>(sql)
is the same ascontext.Set<MyEntity>().SqlQuery(sql).AsNoTracking()
, so there is no difference expected between Test 4 and Test 5.(The length of the result sets was not always the same due to possible duplicates after the random id selection but it was always between 19600 and 19640 elements.)
Edit 2
Test 6
Even 20000 roundtrips to the database are faster than using
Contains
:Result -> msec = 73.6 sec
Note that I have used
SingleOrDefault
instead ofFind
. Using the same code withFind
is very slow (I cancelled the test after several minutes) becauseFind
callsDetectChanges
internally. Disabling auto change detection (context.Configuration.AutoDetectChangesEnabled = false
) leads to roughly the same performance asSingleOrDefault
. UsingAsNoTracking
reduces the time by one or two seconds.Tests were done with database client (console app) and database server on the same machine. The last result might get significantly worse with a "remote" database due to the many roundtrips.
第二种选择肯定比第一种更好。第一个选项将导致对数据库进行
ids.Length
查询,而第二个选项可以在 SQL 查询中使用'IN'
运算符。它基本上会将您的 LINQ 查询转换为类似于以下 SQL 的内容:其中 value1、value2 等是 ids 变量的值。但请注意,我认为可以通过这种方式序列化到查询中的值的数量可能存在上限。我看看能不能找到一些文档...
The second option is definitely better than the first. The first option will result in
ids.Length
queries to the database, while the second option can use an'IN'
operator in the SQL query. It will basically turn your LINQ query into something like the following SQL:where value1, value2 etc. are the values of your ids variable. Be aware, however, that I think there may be an upper limit on the number of values that can be serialized into a query in this way. I'll see if I can find some documentation...
Weel,最近我遇到了类似的问题,我发现最好的方法是将列表插入临时表中,然后进行连接。
这不是一个漂亮的方法,但对于大型列表来说,它的性能非常好。
Weel, recently I have a similar problem and the best way I found was insert the list in a temp Table and then make a join.
It's not a pretty way, but for large lists, it is very performant.
使用 toArray() 将列表转换为数组可以提高性能。你可以这样做:
Transforming the List to an Array with toArray() increases performance. You can do it this way: