当前位置：文江博客话题详情

ParallelEnumerable.GroupBy：它有多懒？

发布于 2024-12-31 20:54:35 字数 863 浏览 1 评论 0 原文

MSDN 表示，ParallelEnumerable.GroupBy 根据指定的键选择器函数对序列的元素进行并行分组。所以我的问题是：它有多懒？

很明显，ParallelQuery> 是惰性的。但是 IGrouping 本身呢，它也很懒吗？

因此，如果我执行以下操作：

var entities = sites.AsParallel()
                         .Select(x => GetDataItemsFromWebsiteLazy(x))
                         .SelectMany(x => x)
                         .GroupBy(dataItem => dataItem.Url.Host)
                         .AsParallel()
                             .SelectMany(x => TransformToEntity(x));

在所有站点获取结果后是否会首次调用 TransformToEntity ？
或者一旦第一个 GetDataItemsFromWebsiteLazy() 方法将产生返回一个元素？

所有这一切的要点是并行向不同主机发出请求。
数据处理如下。对于集合中的每个网站：

请求网站
解析响应并提取另一个网站 url
通过提取的 url 请求网站
解析响应并根据获得的数据创建实体

原文

MSDN says that ParallelEnumerable.GroupBy groups in parallel the elements of a sequence according to a specified key selector function.
So my question is: How lazy it is ?

It's clear that ParallelQuery<IGrouping<,>> is lazy. But what about IGrouping<> itself, is it lazy as well ?

So, if I do the following:

var entities = sites.AsParallel()
                         .Select(x => GetDataItemsFromWebsiteLazy(x))
                         .SelectMany(x => x)
                         .GroupBy(dataItem => dataItem.Url.Host)
                         .AsParallel()
                             .SelectMany(x => TransformToEntity(x));

Will TransformToEntity be called first time after all sites will fetch results?
Or as soon as first GetDataItemsFromWebsiteLazy() method will yield return an element?

The point of all that is to fire requests to different hosts in parallel.
Data processing goes as follows. For every website in a set:

Request website
Parse response and extract another site url
Request site by extracted url
Parse response and create entity from obtained data

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

最丧也最甜 2025-01-07 20:54:35

GroupBy 运算符，均位于 PLINQ 和 LINQ，是通过使用延迟执行来实现的。此方法表示的查询在枚举之前不会执行。但执行查询不会产生您想要的行为。 GroupBy 产生分组，并且在完全枚举源序列之前不会产生分组。 IGrouping 作为一个结构，其内部只是一个带有 Key 属性的不可变数组。它是一个物化集合，而不是一个延迟枚举。当 GroupBy 生成分组时，它包含源序列中具有特定键的所有元素。未来不会再添加其他元素。

显然你想要实现的是所谓的“任务并行性”。这意味着对相同的数据执行不同的操作，这些操作彼此并行运行。相比之下，“数据并行”是对数据的不同子集并行执行相同的操作。 PLINQ 库旨在支持数据并行性。实现任务并行性的首选工具是 TPL 数据流库。如果您确实想使用 PLINQ 执行此操作，这是可行的，但很棘手。您可能仍然会错过诸如反压之类的有用功能，该功能在 TPL Dataflow 中通过 BoundedCapacity 选项。

回复收藏 0 原文

烂人 2025-01-07 20:54:35

事实上，GroupBy 扩展根本不懒惰（或者更准确地说，根本不延迟），这可以通过以下测试程序轻松演示

void Main()
{
    var source = new[] { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 }.AsParallel();
    var groupEven = GetEvenNumbersUsingGroupBy(source);
    // foreach (int num in groupEven) { }
}

IEnumerable<int> GetEvenNumbersUsingGroupBy(IEnumerable<int> source)
{
    Console.WriteLine("Method called: GetEvenNumbersUsingGroupBy");
    var grouped = source.GroupBy(i => i % 2);
    return grouped.Where(g => g.Key == 0).Single();
}

：程序输出以下内容：

调用方法：GetEvenNumbersUsingGroupBy

这意味着即使我们从未真正迭代 GetEvenNumbersUsingGroupBy 方法的结果，它仍然会被执行。

这与使用 yield 语句的普通延迟枚举形成对比，如下所示：

void Main()
{
    var yieldEven = GetEvenNumbersUsingYield(source);
    foreach (int num in yieldEven) { }
    foreach (int num in yieldEven) { }
}

IEnumerable<int> GetEvenNumbersUsingYield(IEnumerable<int> source)
{
    Console.WriteLine("Method called: GetEvenNumbersUsingYield");
    foreach (int i in source)
        if ((i % 2) == 0)   
            yield return i;
}

这将打印以下内容：

调用方法：GetEvenNumbersUsingYield
调用方法：GetEvenNumbersUsingYield

换句话说，每次迭代结果时，都会重新计算结果，这是延迟计算的典型特征（与直接延迟加载相反，后者在第一次计算后缓存结果）。

请注意，无论您是否使用 AsParallel，这都是一样的；它是 GroupBy 扩展的一个特征（根据定义，它需要构建哈希表或其他类型的查找来存储各个组）并且完全独立于并发性。

如果您考虑如何实现延迟分组函数，就很容易理解为什么会出现这种情况；为了迭代单个组的所有元素，您必须迭代整个序列以确保您实际上已经覆盖了该组的所有元素。因此，虽然技术上可以推迟整个序列的一次性迭代，但在大多数情况下可能不值得，因为它将具有与急切的完全相同的内存和 CPU 特性。 - 加载版本。

The GroupBy extension is, in fact, not lazy at all (or, more accurately, not deferred at all), as can be easily demonstrated with the following test program:

void Main()
{
    var source = new[] { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 }.AsParallel();
    var groupEven = GetEvenNumbersUsingGroupBy(source);
    // foreach (int num in groupEven) { }
}

IEnumerable<int> GetEvenNumbersUsingGroupBy(IEnumerable<int> source)
{
    Console.WriteLine("Method called: GetEvenNumbersUsingGroupBy");
    var grouped = source.GroupBy(i => i % 2);
    return grouped.Where(g => g.Key == 0).Single();
}

This program outputs the following:

Method called: GetEvenNumbersUsingGroupBy

Meaning that even though we never actually iterate the result of the GetEvenNumbersUsingGroupBy method, it still gets executed.

This is in contrast to a normal deferred enumerable using the yield statement, as in:

void Main()
{
    var yieldEven = GetEvenNumbersUsingYield(source);
    foreach (int num in yieldEven) { }
    foreach (int num in yieldEven) { }
}

IEnumerable<int> GetEvenNumbersUsingYield(IEnumerable<int> source)
{
    Console.WriteLine("Method called: GetEvenNumbersUsingYield");
    foreach (int i in source)
        if ((i % 2) == 0)   
            yield return i;
}

This prints the following:

Method called: GetEvenNumbersUsingYield
Method called: GetEvenNumbersUsingYield

In other words, each time you iterate the results, they are re-evaluated, which is a typical characteristic of deferred evaluation (as opposed to straight-up lazy loading which caches the result after the first evaluation).

Note that this is the same whether you use AsParallel or not; it's a characteristic of the GroupBy extension (which by definition needs to build a hash table or other kind of lookup in order to store the individual groups) and wholly independent of concurrency.

It's easy to see why this is the case if you think about how you would implement a deferred grouping function; in order to iterate all of the elements of a single group, you would have to iterate the entire sequence to be sure that you've actually covered all of the elements of that group. So while it might technically be possible to defer this one-time iteration of the entire sequence, it's probably not worth it in most cases, since it's going to have the exact same memory and CPU characteristics as the eagerly-loaded version.

回复收藏 0 原文

~没有更多了~

关于作者

孤星

暂无简介

文章

25 人气

关注发私信

友情链接

文江博客

ParallelEnumerable.GroupBy：它有多懒？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

琉璃梦幻

qq_4zWU6L

话少情深

西西弗的石头怪

彻夜缠绵

千寻…

友情链接

ParallelEnumerable.GroupBy：它有多懒？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

琉璃梦幻

qq_4zWU6L

话少情深

西西弗的石头怪

彻夜缠绵

千寻…

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。