为什么.Contains 慢?通过主键获取多个实体的最有效方法?

发布于 2024-12-15 04:52:11 字数 452 浏览 2 评论 0原文

通过主键选择多个实体的最有效方法是什么?

public IEnumerable<Models.Image> GetImagesById(IEnumerable<int> ids)
{

    //return ids.Select(id => Images.Find(id));       //is this cool?
    return Images.Where( im => ids.Contains(im.Id));  //is this better, worse or the same?
    //is there a (better) third way?

}

我意识到我可以做一些性能测试来进行比较,但我想知道实际上是否有比这两种方法更好的方法,并且正在寻找一旦它们被“翻译”,就可以了解这两个查询之间的区别(如果有的话)。

What's the most efficient way to select multiple entities by primary key?

public IEnumerable<Models.Image> GetImagesById(IEnumerable<int> ids)
{

    //return ids.Select(id => Images.Find(id));       //is this cool?
    return Images.Where( im => ids.Contains(im.Id));  //is this better, worse or the same?
    //is there a (better) third way?

}

I realise that I could do some performance tests to compare, but I am wondering if there is in fact a better way than both, and am looking for some enlightenment on what the difference between these two queries is, if any, once they have been 'translated'.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

尤怨 2024-12-22 04:52:11

更新:通过在 EF6 中添加 InExpression,处理 Enumerable.Contains 的性能显着提高。这个答案中的分析很棒,但自 2013 年以来基本上已过时。

在实体框架中使用 Contains 实际上非常慢。确实,它会转换为 SQL 中的 IN 子句,并且 SQL 查询本身执行速度很快。但问题和性能瓶颈在于从 LINQ 查询到 SQL 的转换。将创建的表达式树将扩展为一长串 OR 连接,因为没有表示 IN 的本机表达式。创建 SQL 时,许多 OR 的表达式会被识别并折叠回 SQL IN 子句中。

这并不意味着使用 Contains 比为 ids 集合中的每个元素发出一个查询(您的第一个选项)更糟糕。它可能仍然更好 - 至少对于不太大的集合来说。但对于大型收藏来说,这确实很糟糕。我记得不久前我测试过一个包含大约 12.000 个元素的 Contains 查询,该查询有效,但需要大约一分钟的时间,尽管 SQL 中的查询执行时间不到一秒。

可能值得测试多个往返数据库的组合的性能,并且每个往返的 Contains 表达式中的元素数量较少。

此处显示并解释了这种方法以及将 Contains 与实体框架一起使用的限制:

为什么 Contains() 运算符会降低实体框架的性能戏剧性地?

在这种情况下,原始 SQL 命令可能会表现最佳,这意味着您调用 dbContext.Database.SqlQuery(sqlString) 或 dbContext.Images .SqlQuery(sqlString) 其中 sqlString 是 @Rune 答案中显示的 SQL。

编辑

以下是一些测量结果:

我在一个包含 550000 条记录和 11 列的表上完成了此操作(ID 从 1 开始,没有间隙),并随机选取了 20000 个 id:

using (var context = new MyDbContext())
{
    Random rand = new Random();
    var ids = new List<int>();
    for (int i = 0; i < 20000; i++)
        ids.Add(rand.Next(550000));

    Stopwatch watch = new Stopwatch();
    watch.Start();

    // here are the code snippets from below

    watch.Stop();
    var msec = watch.ElapsedMilliseconds;
}

测试 1

var result = context.Set<MyEntity>()
    .Where(e => ids.Contains(e.ID))
    .ToList();

结果-> 毫秒 = 85.5 秒

测试 2

var result = context.Set<MyEntity>().AsNoTracking()
    .Where(e => ids.Contains(e.ID))
    .ToList();

结果 -> 毫秒 = 84.5 秒

AsNoTracking 的这种微小效果非常不寻常。它表明瓶颈不是对象物化(也不是 SQL,如下所示)。

对于这两个测试,可以在 SQL Profiler 中看到 SQL 查询很晚才到达数据库。 (我没有精确测量,但晚于 70 秒。)显然,将此 LINQ 查询转换为 SQL 的成本非常昂贵。

测试3

var values = new StringBuilder();
values.AppendFormat("{0}", ids[0]);
for (int i = 1; i < ids.Count; i++)
    values.AppendFormat(", {0}", ids[i]);

var sql = string.Format(
    "SELECT * FROM [MyDb].[dbo].[MyEntities] WHERE [ID] IN ({0})",
    values);

var result = context.Set<MyEntity>().SqlQuery(sql).ToList();

结果-> 毫秒 = 5.1 秒

测试 4

// same as Test 3 but this time including AsNoTracking
var result = context.Set<MyEntity>().SqlQuery(sql).AsNoTracking().ToList();

结果 -> 毫秒 = 3.8 秒

这次禁用跟踪的效果更加明显。

测试 5

// same as Test 3 but this time using Database.SqlQuery
var result = context.Database.SqlQuery<MyEntity>(sql).ToList();

结果 -> msec = 3.7 sec

我的理解是 context.Database.SqlQuery(sql)context.Set().SqlQuery 相同(sql).AsNoTracking(),因此测试 4 和测试 5 之间没有预期的差异。

(由于可能的原因,结果集的长度并不总是相同随机 ID 选择后会重复,但总是在 19600 到 19640 个元素之间。)

编辑 2

测试 6

即使到数据库的 20000 次往返也比使用 Contains< 更快/代码>:

var result = new List<MyEntity>();
foreach (var id in ids)
    result.Add(context.Set<MyEntity>().SingleOrDefault(e => e.ID == id));

结果-> msec = 73.6 sec

请注意,我使用了 SingleOrDefault 而不是 Find。使用与 Find 相同的代码非常慢(几分钟后我取消了测试),因为 Find 在内部调用 DetectChanges。禁用自动更改检测 (context.Configuration.AutoDetectChangesEnabled = false) 会导致与 SingleOrDefault 大致相同的性能。使用 AsNoTracking 可以将时间缩短一到两秒。

测试是在同一台计算机上使用数据库客户端(控制台应用程序)和数据库服务器完成的。由于多次往返,“远程”数据库的最后结果可能会变得更糟。

UPDATE: With the addition of InExpression in EF6, the performance of processing Enumerable.Contains improved dramatically. The analysis in this answer is great but largely obsolete since 2013.

Using Contains in Entity Framework is actually very slow. It's true that it translates into an IN clause in SQL and that the SQL query itself is executed fast. But the problem and the performance bottleneck is in the translation from your LINQ query into SQL. The expression tree which will be created is expanded into a long chain of OR concatenations because there is no native expression which represents an IN. When the SQL is created this expression of many ORs is recognized and collapsed back into the SQL IN clause.

This does not mean that using Contains is worse than issuing one query per element in your ids collection (your first option). It's probably still better - at least for not too large collections. But for large collections it is really bad. I remember that I had tested some time ago a Contains query with about 12.000 elements which worked but took around a minute even though the query in SQL executed in less than a second.

It might be worth to test the performance of a combination of multiple roundtrips to the database with a smaller number of elements in a Contains expression for each roundtrip.

This approach and also the limitations of using Contains with Entity Framework is shown and explained here:

Why does the Contains() operator degrade Entity Framework's performance so dramatically?

It's possible that a raw SQL command will perform best in this situation which would mean that you call dbContext.Database.SqlQuery<Image>(sqlString) or dbContext.Images.SqlQuery(sqlString) where sqlString is the SQL shown in @Rune's answer.

Edit

Here are some measurements:

I have done this on a table with 550000 records and 11 columns (IDs start from 1 without gaps) and picked randomly 20000 ids:

using (var context = new MyDbContext())
{
    Random rand = new Random();
    var ids = new List<int>();
    for (int i = 0; i < 20000; i++)
        ids.Add(rand.Next(550000));

    Stopwatch watch = new Stopwatch();
    watch.Start();

    // here are the code snippets from below

    watch.Stop();
    var msec = watch.ElapsedMilliseconds;
}

Test 1

var result = context.Set<MyEntity>()
    .Where(e => ids.Contains(e.ID))
    .ToList();

Result -> msec = 85.5 sec

Test 2

var result = context.Set<MyEntity>().AsNoTracking()
    .Where(e => ids.Contains(e.ID))
    .ToList();

Result -> msec = 84.5 sec

This tiny effect of AsNoTracking is very unusual. It indicates that the bottleneck is not object materialization (and not SQL as shown below).

For both tests it can be seen in SQL Profiler that the SQL query arrives at the database very late. (I didn't measure exactly but it was later than 70 seconds.) Obviously the translation of this LINQ query into SQL is very expensive.

Test 3

var values = new StringBuilder();
values.AppendFormat("{0}", ids[0]);
for (int i = 1; i < ids.Count; i++)
    values.AppendFormat(", {0}", ids[i]);

var sql = string.Format(
    "SELECT * FROM [MyDb].[dbo].[MyEntities] WHERE [ID] IN ({0})",
    values);

var result = context.Set<MyEntity>().SqlQuery(sql).ToList();

Result -> msec = 5.1 sec

Test 4

// same as Test 3 but this time including AsNoTracking
var result = context.Set<MyEntity>().SqlQuery(sql).AsNoTracking().ToList();

Result -> msec = 3.8 sec

This time the effect of disabling tracking is more noticable.

Test 5

// same as Test 3 but this time using Database.SqlQuery
var result = context.Database.SqlQuery<MyEntity>(sql).ToList();

Result -> msec = 3.7 sec

My understanding is that context.Database.SqlQuery<MyEntity>(sql) is the same as context.Set<MyEntity>().SqlQuery(sql).AsNoTracking(), so there is no difference expected between Test 4 and Test 5.

(The length of the result sets was not always the same due to possible duplicates after the random id selection but it was always between 19600 and 19640 elements.)

Edit 2

Test 6

Even 20000 roundtrips to the database are faster than using Contains:

var result = new List<MyEntity>();
foreach (var id in ids)
    result.Add(context.Set<MyEntity>().SingleOrDefault(e => e.ID == id));

Result -> msec = 73.6 sec

Note that I have used SingleOrDefault instead of Find. Using the same code with Find is very slow (I cancelled the test after several minutes) because Find calls DetectChanges internally. Disabling auto change detection (context.Configuration.AutoDetectChangesEnabled = false) leads to roughly the same performance as SingleOrDefault. Using AsNoTracking reduces the time by one or two seconds.

Tests were done with database client (console app) and database server on the same machine. The last result might get significantly worse with a "remote" database due to the many roundtrips.

拥醉 2024-12-22 04:52:11

第二种选择肯定比第一种更好。第一个选项将导致对数据库进行 ids.Length 查询,而第二个选项可以在 SQL 查询中使用 'IN' 运算符。它基本上会将您的 LINQ 查询转换为类似于以下 SQL 的内容:

SELECT *
FROM ImagesTable
WHERE id IN (value1,value2,...)

其中 value1、value2 等是 ids 变量的值。但请注意,我认为可以通过这种方式序列化到查询中的值的数量可能存在上限。我看看能不能找到一些文档...

The second option is definitely better than the first. The first option will result in ids.Length queries to the database, while the second option can use an 'IN' operator in the SQL query. It will basically turn your LINQ query into something like the following SQL:

SELECT *
FROM ImagesTable
WHERE id IN (value1,value2,...)

where value1, value2 etc. are the values of your ids variable. Be aware, however, that I think there may be an upper limit on the number of values that can be serialized into a query in this way. I'll see if I can find some documentation...

离不开的别离 2024-12-22 04:52:11

Weel,最近我遇到了类似的问题,我发现最好的方法是将列表插入临时表中,然后进行连接。

private List<Foo> GetFoos(IEnumerable<long> ids)
{
    var sb = new StringBuilder();
    sb.Append("DECLARE @Temp TABLE (Id bigint PRIMARY KEY)\n");

    foreach (var id in ids)
    {
        sb.Append("INSERT INTO @Temp VALUES ('");
        sb.Append(id);
        sb.Append("')\n");
    }

    sb.Append("SELECT f.* FROM [dbo].[Foo] f inner join @Temp t on f.Id = t.Id");

    return this.context.Database.SqlQuery<Foo>(sb.ToString()).ToList();
}

这不是一个漂亮的方法,但对于大型列表来说,它的性能非常好。

Weel, recently I have a similar problem and the best way I found was insert the list in a temp Table and then make a join.

private List<Foo> GetFoos(IEnumerable<long> ids)
{
    var sb = new StringBuilder();
    sb.Append("DECLARE @Temp TABLE (Id bigint PRIMARY KEY)\n");

    foreach (var id in ids)
    {
        sb.Append("INSERT INTO @Temp VALUES ('");
        sb.Append(id);
        sb.Append("')\n");
    }

    sb.Append("SELECT f.* FROM [dbo].[Foo] f inner join @Temp t on f.Id = t.Id");

    return this.context.Database.SqlQuery<Foo>(sb.ToString()).ToList();
}

It's not a pretty way, but for large lists, it is very performant.

筱武穆 2024-12-22 04:52:11

使用 toArray() 将列表转换为数组可以提高性能。你可以这样做:

ids.Select(id => Images.Find(id));     
    return Images.toArray().Where( im => ids.Contains(im.Id));  

Transforming the List to an Array with toArray() increases performance. You can do it this way:

ids.Select(id => Images.Find(id));     
    return Images.toArray().Where( im => ids.Contains(im.Id));  
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文