内存优化的 OrderBy 和 Take?

发布于 2024-11-08 22:05:58 字数 369 浏览 0 评论 0原文

我有 9 GB 数据,但我只需要 10 行。当我这样做时:

 data.OrderBy(datum => datum.Column1)
     .Take(10)
     .ToArray();

我收到一个 OutOfMemoryException。我想使用经过优化以降低内存消耗的 OrderByAndTake 方法。写起来很容易,但我想已经有人写过了。我在哪里可以找到它。

编辑:这是 Linq-to-objects。数据来自文件。如果每行的 Column1 值小于当前列表中的 10 个最大值,则可以丢弃该行。

I have 9 GB of data, and I want only 10 rows. When I do:

 data.OrderBy(datum => datum.Column1)
     .Take(10)
     .ToArray();

I get an OutOfMemoryException. I would like to use an OrderByAndTake method, optimized for lower memory consumption. It's easy to write, but I guess someone already did. Where can I find it.

Edit: It's Linq-to-objects. The data comes from a file. Each row can be discarded if its value for Column1 is smaller than the current list of 10 biggest values.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

够运 2024-11-15 22:05:58

我假设您在 Linq to Objects 中执行此操作。你可以做类似的事情......

var best = data
    .Aggregate(new List<T>(), (soFar, current) => soFar
                                                 .Concat(new [] { current })
                                                 .OrderBy(datum => datum.Column1)
                                                 .Take(10)
                                                 .ToList());

这样,并不是所有的项目都需要保存在一个新的排序集合中,只有你感兴趣的最好的 10 个项目。

这是最少的代码方式。由于您知道 soFar 列表已排序,因此可以优化测试插入 current 的位置/是否插入。我不想为你做所有的工作。 ;-)

PS:将 T 替换为您的类型。

编辑:想想看,最有效的方法实际上是一个普通的旧 foreach ,它将每个项目与最佳 10 项的运行列表进行比较。

I'm assuming you're doing this in Linq to Objects. You could do something like...

var best = data
    .Aggregate(new List<T>(), (soFar, current) => soFar
                                                 .Concat(new [] { current })
                                                 .OrderBy(datum => datum.Column1)
                                                 .Take(10)
                                                 .ToList());

In this way, not all the items need to be kept in a new sorted collection, only the best 10 you're interested in.

This was the least code way. Since you know the soFar list is sorted, testing where/if to insert current could be optimized. I didn't feel like doing ALL the work for you. ;-)

PS: Replace T with whatever your type is.

EDIT: Thinking about it, the most efficient way would actually be a plain old foreach that compares each item to the running list of best 10.

娇妻 2024-11-15 22:05:58

可以看出:OrderBy 是一个 Sort,需要存储所有元素(延迟执行被取消)。

data 是 IQueryable 时,它​​应该有效地工作,然后由数据库决定。


  // just 4 fun
  public static IEnumerable<T> TakeDistinctMin<T, TKey>(this IEnumerable<T> @this, 
        int n, Func<T, TKey> selector)            
         where TKey: IComparable<TKey>
  {
        var tops = new SortedList<TKey, T>(n+1);

        foreach (var item in @this)
        {
            TKey k = selector(item);

            if (tops.ContainsKey(k))
                continue;

            if (tops.Count < n)
            {
                tops.Add(k, item);
            }
            else if (k.CompareTo(tops.Keys[tops.Count - 1]) < 0)
            {
                tops.Add(k, item);
                tops.RemoveAt(n);
            }                                    
        }

        return tops.Values;
    }

It figures: OrderBy is a Sort and that requires storing all the elements (deferred execution is cancelled).

It ought to work efficiently when data is an IQueryable, then it's up to the database.


  // just 4 fun
  public static IEnumerable<T> TakeDistinctMin<T, TKey>(this IEnumerable<T> @this, 
        int n, Func<T, TKey> selector)            
         where TKey: IComparable<TKey>
  {
        var tops = new SortedList<TKey, T>(n+1);

        foreach (var item in @this)
        {
            TKey k = selector(item);

            if (tops.ContainsKey(k))
                continue;

            if (tops.Count < n)
            {
                tops.Add(k, item);
            }
            else if (k.CompareTo(tops.Keys[tops.Count - 1]) < 0)
            {
                tops.Add(k, item);
                tops.RemoveAt(n);
            }                                    
        }

        return tops.Values;
    }
最舍不得你 2024-11-15 22:05:58

要对一组无序对象进行排序,您必须查看所有对象,不是吗?

我不明白你如何能够避免解析所有 9 GB 的数据以获取以某种方式排序的前 10 个数据,除非 9 GB 的数据已经以这种方式排序或者如果有索引或其他辅助数据可以利用的结构。

您能否就您的问题提供更多背景信息。您是否使用 LINQ to SQL 或实体框架或其他 O/RM 查询数据库?

To order a set of unordered objects you have to look at all of them, no?

I don't see how you'd be able to avoid parsing all 9 GB of data to get the first 10 ordered in a certain way unless the 9 GB of data was already ordered in that fashion or if there were indexes or other ancillary data structures that could be utilized.

Could you provide a bit more background on your question. Are you querying a database using LINQ to SQL or Entity Framework or some other O/RM?

比忠 2024-11-15 22:05:58

您可以将类似的内容与 投影比较器

public static IEnumerable<T> OrderAndTake<T>(this IEnumerable<T> seq,int count,IComparer<T> comp)
{
  var resultSet=new SortedSet<T>(comp);
  foreach(T elem in seq)
  {
    resultSet.Add(elem);
    if(resultSet.Count>count)
        resultSet.Remove(resultSet.Max);
  }
  return resultSet.Select(x=>x);
}

运行时应该是O(log(count)*seq.Count())并且space O(min(log(count),seq.Count()))

一个问题是,如果有两个元素 comp.Compare(a,b)= ,它就会中断=0 因为该集合不允许重复条目。

You can use something like this together with a projection comparer:

public static IEnumerable<T> OrderAndTake<T>(this IEnumerable<T> seq,int count,IComparer<T> comp)
{
  var resultSet=new SortedSet<T>(comp);
  foreach(T elem in seq)
  {
    resultSet.Add(elem);
    if(resultSet.Count>count)
        resultSet.Remove(resultSet.Max);
  }
  return resultSet.Select(x=>x);
}

Runtime should be O(log(count)*seq.Count()) and space O(min(log(count),seq.Count()))

One issue is that it will break if you have two elements for which comp.Compare(a,b)==0 since the set doesn't allow duplicate entries.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文