对 LINQ 方法的运行时复杂性 (Big-O) 有哪些保证?
我最近开始大量使用 LINQ,而且我还没有真正看到任何有关任何 LINQ 方法的运行时复杂性的提及。显然,这里有很多因素在起作用,因此我们将讨论限制在普通的 IEnumerable
LINQ-to-Objects 提供程序上。此外,我们假设任何作为选择器/变异器等传入的 Func
都是一个廉价的 O(1) 操作。
显然,所有单遍操作(Select
、Where
、Count
、Take/Skip
、< code>Any/All 等)将是 O(n),因为它们只需要遍历序列一次;尽管这也受到了懒惰的影响。
对于更复杂的操作,事情变得更加模糊;类似集合的运算符(Union
、Distinct
、Except
等)默认使用 GetHashCode
(据我所知) ),因此假设他们在内部使用哈希表似乎是合理的,一般来说,这些操作也为 O(n) 。使用 IEqualityComparer
的版本怎么样?
OrderBy
需要排序,因此我们很可能正在考虑 O(n log n)。如果已经排序了怎么办?如果我说 OrderBy().ThenBy()
并为两者提供相同的密钥怎么样?
我可以看到使用排序或散列的 GroupBy
(和 Join
)。是哪一个?
Contains
在 List
上的复杂度为 O(n),但在 HashSet
上的复杂度为 O(1) - LINQ 是否检查底层容器以查看是否它可以加快速度吗?
真正的问题是——到目前为止,我一直相信这些操作是高效的。但是,我可以指望这一点吗?例如,STL 容器明确指定了每个操作的复杂性。 .NET 库规范中是否对 LINQ 性能有任何类似的保证?
更多问题(回应评论):
没有真正考虑过开销,但我没想到简单的 Linq-to-Objects 会有很多开销。 CodingHorror 帖子正在谈论 Linq-to-SQL,我可以理解解析查询并生成 SQL 会增加成本 - 对象提供程序是否也有类似的成本?如果是这样,如果您使用声明性或函数式语法,会有不同吗?
I've recently started using LINQ quite a bit, and I haven't really seen any mention of run-time complexity for any of the LINQ methods. Obviously, there are many factors at play here, so let's restrict the discussion to the plain IEnumerable
LINQ-to-Objects provider. Further, let's assume that any Func
passed in as a selector / mutator / etc. is a cheap O(1) operation.
It seems obvious that all the single-pass operations (Select
, Where
, Count
, Take/Skip
, Any/All
, etc.) will be O(n), since they only need to walk the sequence once; although even this is subject to laziness.
Things are murkier for the more complex operations; the set-like operators (Union
, Distinct
, Except
, etc.) work using GetHashCode
by default (afaik), so it seems reasonable to assume they're using a hash-table internally, making these operations O(n) as well, in general. What about the versions that use an IEqualityComparer
?
OrderBy
would need a sort, so most likely we're looking at O(n log n). What if it's already sorted? How about if I say OrderBy().ThenBy()
and provide the same key to both?
I could see GroupBy
(and Join
) using either sorting, or hashing. Which is it?
Contains
would be O(n) on a List
, but O(1) on a HashSet
- does LINQ check the underlying container to see if it can speed things up?
And the real question - so far, I've been taking it on faith that the operations are performant. However, can I bank on that? STL containers, for example, clearly specify the complexity of every operation. Are there any similar guarantees on LINQ performance in the .NET library specification?
More question (in response to comments):
Hadn't really thought about overhead, but I didn't expect there to be very much for simple Linq-to-Objects. The CodingHorror post is talking about Linq-to-SQL, where I can understand parsing the query and making SQL would add cost - is there a similar cost for the Objects provider too? If so, is it different if you're using the declarative or functional syntax?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
保证非常非常少,但有一些优化:
使用索引访问的扩展方法,例如
ElementAt
、Skip
、Last< /code> 或
LastOrDefault
,将检查基础类型是否实现IList
,以便您获得 O(1) 访问而不是 O(N )。Count
方法检查ICollection
实现,以便此操作的复杂度为 O(1) 而不是 O(N)。Distinct
、GroupBy
Join
,我相信还有集合聚合方法(Union
、>Intersect
和Except
)使用散列,因此它们应该接近 O(N) 而不是 O(N²)。Contains
检查ICollection
实现,因此如果底层集合也是 O(1),那么它可能是 O(1),例如HashSet
,但这取决于实际的数据结构,并且不能保证。哈希集重写Contains
方法,这就是它们的复杂度为 O(1) 的原因。OrderBy
方法使用稳定的快速排序,因此它们的平均情况为 O(N log N)。我认为这涵盖了大多数(如果不是全部)内置扩展方法。性能保证确实很少; Linq 本身会尝试利用高效的数据结构,但这并不是编写可能低效的代码的免费通行证。
There are very, very few guarantees, but there are a few optimizations:
Extension methods that use indexed access, such as
ElementAt
,Skip
,Last
orLastOrDefault
, will check to see whether or not the underlying type implementsIList<T>
, so that you get O(1) access instead of O(N).The
Count
method checks for anICollection
implementation, so that this operation is O(1) instead of O(N).Distinct
,GroupBy
Join
, and I believe also the set-aggregation methods (Union
,Intersect
andExcept
) use hashing, so they should be close to O(N) instead of O(N²).Contains
checks for anICollection
implementation, so it may be O(1) if the underlying collection is also O(1), such as aHashSet<T>
, but this is depends on the actual data structure and is not guaranteed. Hash sets override theContains
method, that's why they are O(1).OrderBy
methods use a stable quicksort, so they're O(N log N) average case.I think that covers most if not all of the built-in extension methods. There really are very few performance guarantees; Linq itself will try to take advantage of efficient data structures but it isn't a free pass to write potentially inefficient code.
我很早就知道,如果枚举是
IList
,则.Count()
返回.Count
。但我总是对 Set 操作的运行时复杂性感到有点厌倦:
.Intersect()
、.Except()
、.Union()< /代码>。
这是
.Intersect()
的反编译 BCL (.NET 4.0/4.5) 实现(我的评论):结论:
IEqualityComparer
也需要匹配。)为了完整起见,这里是
.Union()
和的实现>.Except()
。剧透警告:它们也具有 O(N+M) 复杂性。
I've long known that
.Count()
returns.Count
if the enumeration is anIList
.But I was always a bit weary about the run-time complexity of the Set operations:
.Intersect()
,.Except()
,.Union()
.Here's the decompiled BCL (.NET 4.0/4.5) implementation for
.Intersect()
(comments mine):Conclusions:
IEqualityComparer<T>
also needs to match.)For completeness, here are the implementations for
.Union()
and.Except()
.Spoiler alert: they, too, have O(N+M) complexity.
您真正可以信赖的是,Enumerable 方法针对一般情况编写得很好,并且不会使用幼稚的算法。可能有第三方的东西(博客等)描述了实际使用的算法,但这些不是官方的,也不是 STL 算法那样的保证。
为了说明这一点,这里是来自 System.Core 的
Enumerable.Count
的反映源代码(由 ILSpy 提供):如您所见,它付出了一些努力来避免简单枚举每个元素的天真的解决方案。
All you can really bank on is that the Enumerable methods are well-written for the general case and won't use naive algorithms. There is probably third-party stuff (blogs, etc.) that describe the algorithms actually in use, but these are not official or guaranteed in the sense that STL algorithms are.
To illustrate, here is the reflected source code (courtesy of ILSpy) for
Enumerable.Count
from System.Core:As you can see, it goes to some effort to avoid the naive solution of simply enumerating every element.
我刚刚打破了反射器,它们在调用
Contains
时检查底层类型。I just broke out reflector and they do check the underlying type when
Contains
is called.正确答案是“视情况而定”。这取决于底层 IEnumerable 的类型。我知道对于某些集合(例如实现 ICollection 或 IList 的集合),使用了特殊的代码路径,但是实际的实现并不能保证做任何特殊的事情。例如,我知道 ElementAt() 对于可索引集合有一个特殊情况,与 Count() 类似。但一般来说,您应该假设最坏情况下的 O(n) 性能。
一般来说,我认为您不会找到您想要的性能保证,但如果您确实遇到 linq 运算符的特定性能问题,您始终可以为您的特定集合重新实现它。此外,还有许多博客和可扩展性项目将 Linq to Objects 扩展以添加此类性能保证。查看 索引 LINQ,它扩展并添加到运算符集以获得更多性能优势。
The correct answer is "it depends". it depends on what type the underlying IEnumerable is. i know that for some collections (like collections that implement ICollection or IList) there are special codepaths that are used, However the actual implementation is not guaranteed to do anything special. for example i know that ElementAt() has a special case for indexable collections, similarly with Count(). But in general you should probably assume the worst case O(n) performance.
In generaly i don't think you are going to find the kind of performance guarantees you want, though if you do run into a particular performance problem with a linq operator you can always just reimplement it for your particular collection. Also there are many blogs and extensibility projects which extend Linq to Objects to add these kinds of performance guarantees. check out Indexed LINQ which extends and adds to the operator set for more performance benefits.