组内的 LINQ 到对象索引 + 对于不同的分组(又名 ROW_NUMBER 与 PARTITION BY 等效)

发布于 2024-07-29 06:42:52 字数 1276 浏览 3 评论 0 原文

经过大量 Google 搜索和代码实验后,我被一个复杂的 C# LINQ-to-objects 问题难住了,在 SQL 中,使用一对 ROW_NUMBER()...PARTITION BY 函数和一个或两个子查询很容易解决这个问题。

简而言之,这就是我在代码中尝试做的事情——基本要求是从列表中删除重复的文档:

  1. 首先,按 (Document.Title, Document.SourceId) 对列表进行分组,假设有一个(简化的)类定义像这样:
    类文档 
      { 
          字符串标题; 
          int 源 ID;   // 源具有优先级(ID=1 优于 ID=2) 
      }
  2. 在该组中,为每个文档分配一个索引(例如,索引 0 == 具有此来源的此标题的第一个文档,索引 1 = 具有此来源的此标题的第二个文档,等等)。 我喜欢 SQL 中 ROW_NUMBER() 的等效项!

  3. 现在按 (Document.Title, Index) 分组,其中 Index 是在步骤 #2 中计算的。 对于每一组,仅返回一个文档:Document.SourceId 最低的文档。

步骤#1很简单(例如codepronet.blogspot.com/2009/01/group-by-in-linq.html),但我在步骤#2和#3上遇到了困难。 我似乎无法构建一个无红色曲线的 C# LINQ 查询来解决所有三个步骤。

Anders Heilsberg 在 此主题上的帖子 是我认为如果语法正确的话,上面的步骤 #2 和 #3 的答案。

我宁愿避免使用外部局部变量来进行索引计算,正如 slodge.blogspot.com/2009/01/adding-row-number-using-linq-to-objects.html 上建议的那样,因为该解决方案会中断如果外部变量被修改。

最佳情况下,可以首先完成按标题分组步骤,因此“内部”分组(首先按源计算索引,然后按索引过滤掉重复项)可以对每个“按标题”中的少量对象进行操作组,因为每个按标题组中的文档数量通常低于 100。我真的不想要 N2 解决方案!

我当然可以使用嵌套的 foreach 循环来解决这个问题,但这似乎是使用 LINQ 应该很简单的问题。

有任何想法吗?

After much Google searching and code experimentation, I'm stumped on a complex C# LINQ-to-objects problem which in SQL would be easy to solve with a pair of ROW_NUMBER()...PARTITION BY functions and a subquery or two.

Here's, in words, what I'm trying to do in code-- the underlying requirement is removing duplicate documents from a list:

  1. First, group a list by (Document.Title, Document.SourceId), assuming a (simplified) class definition like this:
    class Document
    {
        string Title;
        int SourceId; // sources are prioritized (ID=1 better than ID=2)
    }
  2. Within that group, assign each document an index (e.g. Index 0 == 1st document with this title from this source, Index 1 = 2nd document with this title from this source, etc.). I'd love the equivalent of ROW_NUMBER() in SQL!

  3. Now group by (Document.Title, Index), where Index was computed in Step #2. For each group, return only one document: the one with the lowest Document.SourceId.

Step #1 is easy (e.g. codepronet.blogspot.com/2009/01/group-by-in-linq.html), but I'm getting stumped on steps #2 and #3. I can't seem to build a red-squiggle-free C# LINQ query to solve all three steps.

Anders Heilsberg's post on this thread is I think the answer to Steps #2 and #3 above if I could get the syntax right.

I'd prefer to avoid using an external local variable to do the Index computation, as recommended on slodge.blogspot.com/2009/01/adding-row-number-using-linq-to-objects.html, since that solution breaks if the external variable is modified.

Optimally, the group-by-Title step could be done first, so the "inner" groupings (first by Source to compute the index, then by Index to filter out duplicates) can operate on small numbers of objects in each "by title" group, since the # of documents in each by-title group is usually under 100. I really don't want an N2 solution!

I could certainly solve this with nested foreach loops, but it seems like the kind of problem which should be simple with LINQ.

Any ideas?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

心如狂蝶 2024-08-05 06:42:52

我认为 jpbochi 错过了您希望按值对进行分组(标题+SourceId,然后标题+索引)。 这是一个 LINQ 查询(主要)解决方案:

var selectedFew = 
    from doc in docs
    group doc by new { doc.Title, doc.SourceId } into g
    from docIndex in g.Select((d, i) => new { Doc = d, Index = i })
    group docIndex by new { docIndex.Doc.Title, docIndex.Index } into g
    select g.Aggregate((a,b) => (a.Doc.SourceId <= b.Doc.SourceId) ? a : b);

首先,我们按 Title+SourceId 进行分组(我使用匿名类型,因为编译器为分组查找构建了一个良好的哈希码)。 然后我们使用 Select 将分组索引附加到文档,我们在第二个分组中使用它。 最后,对于每个组,我们选择最低的 SourceId。

鉴于此输入:

var docs = new[] {
    new { Title = "ABC", SourceId = 0 },
    new { Title = "ABC", SourceId = 4 },
    new { Title = "ABC", SourceId = 2 },
    new { Title = "123", SourceId = 7 },
    new { Title = "123", SourceId = 7 },
    new { Title = "123", SourceId = 7 },
    new { Title = "123", SourceId = 5 },
    new { Title = "123", SourceId = 5 },
};

我得到此输出:

{ Doc = { Title = ABC, SourceId = 0 }, Index = 0 }
{ Doc = { Title = 123, SourceId = 5 }, Index = 0 }
{ Doc = { Title = 123, SourceId = 5 }, Index = 1 }
{ Doc = { Title = 123, SourceId = 7 }, Index = 2 }

更新:我刚刚看到您关于首先按标题分组的问题。 您可以在标题组上使用子查询来执行此操作:

var selectedFew =
    from doc in docs
    group doc by doc.Title into titleGroup
    from docWithIndex in
        (
            from doc in titleGroup
            group doc by doc.SourceId into idGroup
            from docIndex in idGroup.Select((d, i) => new { Doc = d, Index = i })
            group docIndex by docIndex.Index into indexGroup
            select indexGroup.Aggregate((a,b) => (a.Doc.SourceId <= b.Doc.SourceId) ? a : b)
        )
    select docWithIndex;

I think jpbochi missed that you want your groupings to be by pairs of values (Title+SourceId then Title+Index). Here's a LINQ query (mostly) solution:

var selectedFew = 
    from doc in docs
    group doc by new { doc.Title, doc.SourceId } into g
    from docIndex in g.Select((d, i) => new { Doc = d, Index = i })
    group docIndex by new { docIndex.Doc.Title, docIndex.Index } into g
    select g.Aggregate((a,b) => (a.Doc.SourceId <= b.Doc.SourceId) ? a : b);

First we group by Title+SourceId (I use an anonymous type because the compiler builds a good hashcode for the grouping lookup). Then we use Select to attach the grouped index to the document, which we use in our second grouping. Finally, for each group we pick the lowest SourceId.

Given this input:

var docs = new[] {
    new { Title = "ABC", SourceId = 0 },
    new { Title = "ABC", SourceId = 4 },
    new { Title = "ABC", SourceId = 2 },
    new { Title = "123", SourceId = 7 },
    new { Title = "123", SourceId = 7 },
    new { Title = "123", SourceId = 7 },
    new { Title = "123", SourceId = 5 },
    new { Title = "123", SourceId = 5 },
};

I get this output:

{ Doc = { Title = ABC, SourceId = 0 }, Index = 0 }
{ Doc = { Title = 123, SourceId = 5 }, Index = 0 }
{ Doc = { Title = 123, SourceId = 5 }, Index = 1 }
{ Doc = { Title = 123, SourceId = 7 }, Index = 2 }

Update: I just saw your question about grouping by Title first. You can do this using a subquery on your Title groups:

var selectedFew =
    from doc in docs
    group doc by doc.Title into titleGroup
    from docWithIndex in
        (
            from doc in titleGroup
            group doc by doc.SourceId into idGroup
            from docIndex in idGroup.Select((d, i) => new { Doc = d, Index = i })
            group docIndex by docIndex.Index into indexGroup
            select indexGroup.Aggregate((a,b) => (a.Doc.SourceId <= b.Doc.SourceId) ? a : b)
        )
    select docWithIndex;
任谁 2024-08-05 06:42:52

说实话,我对你的问题很困惑。 也许你应该解释一下你想要解决的问题。 无论如何,我会尝试回答我所理解的。

1) 首先,我假设您已经有一个按 Title+SourceId 分组的文档列表。 出于测试目的,我硬编码了一个列表,如下所示:

var docs = new [] {
    new { Title = "ABC", SourceId = 0 },
    new { Title = "ABC", SourceId = 4 },
    new { Title = "ABC", SourceId = 2 },
    new { Title = "123", SourceId = 7 },
    new { Title = "123", SourceId = 5 },
};

2) 要在每个项目中放置索引,您可以使用 Select 扩展方法,传递 Func 选择器函数。 像这样:

var docsWithIndex
    = docs
    .Select( (d, i) => new { Doc = d, Index = i } );

3)根据我的理解,下一步是按 Title 对最后一个结果进行分组。 操作方法如下:

var docsGroupedByTitle
    = docsWithIndex
    .GroupBy( a => a.Doc.Title );

GroupBy 函数(上面使用的)返回 IEnumerable>。 由于群也是可枚举的,所以我们现在有可枚举的可枚举的。

4) 现在,对于上面的每个组,我们将仅获取具有最小 SourceId 的项目。 为了进行此操作,我们需要 2 级递归。 在 LINQ 中,外层是一个选择(对于每个组,获取其其中一个项目),内层是一个聚合(获取具有最低 SourceId 的项目):

var selectedFew
    = docsGroupedByTitle
    .Select(
        g => g.Aggregate(
            (a, b) => (a.Doc.SourceId  <= b.Doc.SourceId) ? a : b
        )
    );

只是为了确保它有效,我用一个简单的 foreach 测试了它:

foreach (var a in selectedFew) Console.WriteLine(a);
//The result will be:
//{ Doc = { Title = ABC, SourceId = 0 }, Index = 0 }
//{ Doc = { Title = 123, SourceId = 5 }, Index = 4 }

我不确定这就是你想要的。 如果没有,请评论答案,我可以修复答案。 我希望这有帮助。

观察:我的测试中使用的所有类都是匿名。 因此,您实际上不需要定义 DocumentWithIndex 类型。 实际上,我什至还没有声明一个 Document 类。

To be honest, I'm quite confused with your question. Maybe if you should explain what you're trying to solve. Anyway, I'll try to answer what I understood.

1) First, I'll assume that you already have a list of documents grouped by Title+SourceId. For testing purposes, I hardcoded a list as follow:

var docs = new [] {
    new { Title = "ABC", SourceId = 0 },
    new { Title = "ABC", SourceId = 4 },
    new { Title = "ABC", SourceId = 2 },
    new { Title = "123", SourceId = 7 },
    new { Title = "123", SourceId = 5 },
};

2) To get put a index in every item, you can use the Select extension method, passing a Func selector function. Like this:

var docsWithIndex
    = docs
    .Select( (d, i) => new { Doc = d, Index = i } );

3) From what I understood, the next step would be to group the last result by Title. Here's how to do it:

var docsGroupedByTitle
    = docsWithIndex
    .GroupBy( a => a.Doc.Title );

The GroupBy function (used above) returns an IEnumerable<IGrouping<string,DocumentWithIndex>>. Since a group is enumerable too, we now have an enumerable of enumerables.

4) Now, for each of the groups above, we'll get only the item with the minimum SourceId. To make this operation we'll need 2 levels of recursion. In LINQ, the outer level is a selection (for each group, get one of its items), and the inner level is an aggregation (get the item with the lowest SourceId):

var selectedFew
    = docsGroupedByTitle
    .Select(
        g => g.Aggregate(
            (a, b) => (a.Doc.SourceId  <= b.Doc.SourceId) ? a : b
        )
    );

Just to ensure that it works, I tested it with a simple foreach:

foreach (var a in selectedFew) Console.WriteLine(a);
//The result will be:
//{ Doc = { Title = ABC, SourceId = 0 }, Index = 0 }
//{ Doc = { Title = 123, SourceId = 5 }, Index = 4 }

I'm not sure that's what you wanted. If not, please comment the answer and I can fix the answer. I hope this helps.

Obs.: All the classes used in my tests were anonymous. So, you don't really need to define a DocumentWithIndex type. Actually, I haven't even declared a Document class.

默嘫て 2024-08-05 06:42:52

基于方法的语法:

var selectedFew = docs.GroupBy(doc => new {doc.Title, doc.SourceId}, doc => doc)
                      .SelectMany((grouping) => grouping.Select((doc, index) => new {doc, index}))
                              .GroupBy(anon => new {anon.doc.Title, anon.index})
                              .Select(grouping => grouping.Aggregate((a, b) =>    a.doc.SourceId <= b.doc.SourceId ? a : b));

您认为上面的语法是等效的基于方法的语法吗?

Method Based Syntax:

var selectedFew = docs.GroupBy(doc => new {doc.Title, doc.SourceId}, doc => doc)
                      .SelectMany((grouping) => grouping.Select((doc, index) => new {doc, index}))
                              .GroupBy(anon => new {anon.doc.Title, anon.index})
                              .Select(grouping => grouping.Aggregate((a, b) =>    a.doc.SourceId <= b.doc.SourceId ? a : b));

Would you say the above is the equivalent Method based syntax?

梦忆晨望 2024-08-05 06:42:52

我实现了一个扩展方法。 它支持多个字段分区以及多个订单条件。

public static IEnumerable<TResult> Partition<TSource, TKey, TResult>(
    this IEnumerable<TSource> source, 
    Func<TSource, TKey> keySelector,
    Func<IEnumerable<TSource>, IOrderedEnumerable<TSource>> sorter,
    Func<TSource, int, TResult> selector)
{
    AssertUtilities.ArgumentNotNull(source, "source");

    return source
        .GroupBy(keySelector)
        .Select(arg => sorter(arg).Select(selector))
        .SelectMany(arg => arg);
}

用法:

var documents = new[] 
{
    new { Title = "Title1", SourceId = 1 },
    new { Title = "Title1", SourceId = 2 },
    new { Title = "Title2", SourceId = 15 },
    new { Title = "Title2", SourceId = 14 },
    new { Title = "Title3", SourceId = 100 }
};

var result = documents
    .Partition(
        arg => arg.Title,  // partition by
        arg => arg.OrderBy(x => x.SourceId), // order by
        (arg, rowNumber) => new { RowNumber = rowNumber, Document = arg }) // select
    .Where(arg => arg.RowNumber == 0)
    .Select(arg => arg.Document)
    .ToList();

结果:

{ Title = "Title1", SourceId = 1 },
{ Title = "Title2", SourceId = 14 },
{ Title = "Title3", SourceId = 100 }

I implemented an extension method. It supports multiple partition by fields as well as multiple order conditions.

public static IEnumerable<TResult> Partition<TSource, TKey, TResult>(
    this IEnumerable<TSource> source, 
    Func<TSource, TKey> keySelector,
    Func<IEnumerable<TSource>, IOrderedEnumerable<TSource>> sorter,
    Func<TSource, int, TResult> selector)
{
    AssertUtilities.ArgumentNotNull(source, "source");

    return source
        .GroupBy(keySelector)
        .Select(arg => sorter(arg).Select(selector))
        .SelectMany(arg => arg);
}

Usage:

var documents = new[] 
{
    new { Title = "Title1", SourceId = 1 },
    new { Title = "Title1", SourceId = 2 },
    new { Title = "Title2", SourceId = 15 },
    new { Title = "Title2", SourceId = 14 },
    new { Title = "Title3", SourceId = 100 }
};

var result = documents
    .Partition(
        arg => arg.Title,  // partition by
        arg => arg.OrderBy(x => x.SourceId), // order by
        (arg, rowNumber) => new { RowNumber = rowNumber, Document = arg }) // select
    .Where(arg => arg.RowNumber == 0)
    .Select(arg => arg.Document)
    .ToList();

Result:

{ Title = "Title1", SourceId = 1 },
{ Title = "Title2", SourceId = 14 },
{ Title = "Title3", SourceId = 100 }
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文