ParallelEnumerable.GroupBy:它有多懒?
MSDN 表示,ParallelEnumerable.GroupBy
根据指定的键选择器函数对序列的元素进行并行分组。
所以我的问题是:它有多懒?
很明显,ParallelQuery
是惰性的。但是 IGrouping
本身呢,它也很懒吗?
因此,如果我执行以下操作:
var entities = sites.AsParallel()
.Select(x => GetDataItemsFromWebsiteLazy(x))
.SelectMany(x => x)
.GroupBy(dataItem => dataItem.Url.Host)
.AsParallel()
.SelectMany(x => TransformToEntity(x));
在所有站点获取结果后是否会首次调用 TransformToEntity
?
或者一旦第一个 GetDataItemsFromWebsiteLazy()
方法将产生返回一个元素?
所有这一切的要点是并行向不同主机发出请求。
数据处理如下。对于集合中的每个网站:
- 请求网站
- 解析响应并提取另一个网站 url
- 通过提取的 url 请求网站
- 解析响应并根据获得的数据创建实体
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
GroupBy
运算符,均位于 PLINQ 和 LINQ,是通过使用延迟执行来实现的。此方法表示的查询在枚举之前不会执行。但执行查询不会产生您想要的行为。GroupBy
产生分组,并且在完全枚举源序列之前不会产生分组。IGrouping
作为一个结构,其内部只是一个带有
Key
属性的不可变数组。它是一个物化集合,而不是一个延迟枚举。当GroupBy
生成分组时,它包含源序列中具有特定键的所有元素。未来不会再添加其他元素。显然你想要实现的是所谓的“任务并行性”。这意味着对相同的数据执行不同的操作,这些操作彼此并行运行。相比之下,“数据并行”是对数据的不同子集并行执行相同的操作。 PLINQ 库旨在支持数据并行性。实现任务并行性的首选工具是 TPL 数据流 库。如果您确实想使用 PLINQ 执行此操作,这是可行的,但很棘手。您可能仍然会错过诸如反压之类的有用功能,该功能在 TPL Dataflow 中通过
BoundedCapacity
选项。The
GroupBy
operator, both in PLINQ and LINQ, is implemented by using deferred execution. The query represented by this method is not executed until it is enumerated. But executing the query does not have the behavior that you want. TheGroupBy
yields groupings, and no grouping is yielded before the source sequence is fully enumerated. TheIGrouping<TKey,TElement>
as a structure is internally just an immutable array with aKey
property. It's a materialized collection, not a deferred enumerable. When theGroupBy
yields a grouping, it contains all the elements of the source sequence that have the specific key. No other element is going to be added to it in the future.Apparently what you want to achieve is what is known as "task parallelism". This means performing different operations on the same data, where these operations are running in parallel to each other. In contrast the "data parallelism" is to perform the same operation in parallel on different subsets of the data. The PLINQ library is designed to support data parallelism. The tool of choice for implementing task parallelism is the TPL Dataflow library. If you really want to do it with PLINQ, it's doable but tricky. And you might still miss useful features like the backpressure, which is natively supported in TPL Dataflow with the
BoundedCapacity
option.事实上,
GroupBy
扩展根本不懒惰(或者更准确地说,根本不延迟),这可以通过以下测试程序轻松演示:程序输出以下内容:
这意味着即使我们从未真正迭代
GetEvenNumbersUsingGroupBy
方法的结果,它仍然会被执行。这与使用 yield 语句的普通延迟枚举形成对比,如下所示:
这将打印以下内容:
换句话说,每次迭代结果时,都会重新计算结果,这是延迟计算的典型特征(与直接延迟加载相反,后者在第一次计算后缓存结果)。
请注意,无论您是否使用
AsParallel
,这都是一样的;它是 GroupBy 扩展的一个特征(根据定义,它需要构建哈希表或其他类型的查找来存储各个组)并且完全独立于并发性。如果您考虑如何实现延迟分组函数,就很容易理解为什么会出现这种情况;为了迭代单个组的所有元素,您必须迭代整个序列以确保您实际上已经覆盖了该组的所有元素。因此,虽然技术上可以推迟整个序列的一次性迭代,但在大多数情况下可能不值得,因为它将具有与急切的完全相同的内存和 CPU 特性。 - 加载版本。
The
GroupBy
extension is, in fact, not lazy at all (or, more accurately, not deferred at all), as can be easily demonstrated with the following test program:This program outputs the following:
Meaning that even though we never actually iterate the result of the
GetEvenNumbersUsingGroupBy
method, it still gets executed.This is in contrast to a normal deferred enumerable using the
yield
statement, as in:This prints the following:
In other words, each time you iterate the results, they are re-evaluated, which is a typical characteristic of deferred evaluation (as opposed to straight-up lazy loading which caches the result after the first evaluation).
Note that this is the same whether you use
AsParallel
or not; it's a characteristic of theGroupBy
extension (which by definition needs to build a hash table or other kind of lookup in order to store the individual groups) and wholly independent of concurrency.It's easy to see why this is the case if you think about how you would implement a deferred grouping function; in order to iterate all of the elements of a single group, you would have to iterate the entire sequence to be sure that you've actually covered all of the elements of that group. So while it might technically be possible to defer this one-time iteration of the entire sequence, it's probably not worth it in most cases, since it's going to have the exact same memory and CPU characteristics as the eagerly-loaded version.