我正在使用 LINQ to Objects,想知道是否可以通过使用我拥有的索引来提高查询性能。最好用一个例子来解释这一点。想象一个简单的类型...
public class Person
{
public int Age;
public string FirstName;
public string LastName;
}
以及我将针对它进行的一个简单查询...
List<Person> people = new List<Person>();
// 'people' populated with 50,000 instances...
var x = from t in people
where t.Age > 18 && t.Age < 21
select t;
如果我正确理解 LINQ to Objects,那么Where 扩展方法的实现将枚举 people 集合中的所有 50,000 个实例,以便找到该类型的 100 个实例。实际上匹配。碰巧我已经有了一个按年龄排序的人员集合的索引。像这样...
SortedList<int, Person> ageSorted = new SortedList<int, Person>();
显然,如果我可以获得在何处使用 SortedList,这样它就不再需要枚举所有 50,000 个实例,而是查找 100 个匹配条目的范围,从而节省时间,这是有意义的。
是否可以扩展 LINQ to Objects 来实现我的情况?这已经可能了,但我缺少这项技术吗?
I am using LINQ to Objects and wonder if it is possible to improve the performance of my queries by making use of an index that I have. This is best explained with an example. Imagine a simple type...
public class Person
{
public int Age;
public string FirstName;
public string LastName;
}
And a simple query I would make against it...
List<Person> people = new List<Person>();
// 'people' populated with 50,000 instances...
var x = from t in people
where t.Age > 18 && t.Age < 21
select t;
If I understand LINQ to Objects correctly then the implementation of the Where extension method will enumerate all 50,000 instances in the people collection in order to find the 100 that actually match. As it happens I already have an index of the people collection that is sorted by Age. Like this...
SortedList<int, Person> ageSorted = new SortedList<int, Person>();
Clearly it would make sense if I could get the Where to use the SortedList so that it no longer has to enumerate all 50,000 instances, instead finding the range of 100 matching entries and so saving time.
Is it possible to extend LINQ to Objects to enable my situation? Is it already possible but I am missing the technique?
发布评论
评论(4)
我相信已经有一个项目可以做到这一点 - i4o。我不能说我自己使用过它,但这听起来像是您想要的东西......您可能需要稍微调整一下现有的代码,但它确实值得一看。
如果这没有帮助,您至少可以在
SortedList
上编写自己的扩展方法。您可能无法轻松使用实际的where
子句,但您可以使用自己的方法获取最小值和最大值。您可能还希望将它们应用于IList
,您断言您已经对值进行了适当的排序(根据某些比较器)。例如(完全未经测试):(
如果您只有
List
而不是IList
,您可以使用List.BinarySearch
,尽管您需要构建自定义IComparer
。)另外,请查看
SortedSet
。There's already a project which I believe does exactly that - i4o. I can't say I've used it myself, but it sounds like the kind of thing you want... you may need to juggle your existing code a bit, but it's certainly worth looking at.
If that doesn't help, you could at least write your own extension methods on
SortedList<TKey, TValue>
. You probably wouldn't be able to easily use your actualwhere
clause, but you could use your own methods taking a minimum and a maximum value. You might also want to make them apply toIList<T>
where you assert that you've already sorted the values appropriately (according to some comparer).For example (completely untested):
(If you only have
List<T>
instead ofIList<T>
, you could useList<T>.BinarySearch
, although you'd need to build a customIComparer<T>
.)Also, have a look at
SortedSet<T>
in .NET 4.你是对的,你编写的查询将枚举整个列表,因为显然 LINQ 无法假设有关你的数据的任何信息。
如果您有一个 SortedList,您可以使用 SkipWhile/TakeWhile linq 方法来利用它:
编辑
@Davy8 当然是正确的,最坏的情况下它仍然具有相同的性能。请参阅其他答案,了解更快速找到第一个值的方法。
如果您需要针对不同的年龄范围多次执行此操作,那么您也可以通过按年龄分组来加快速度:
You're right that the query you wrote will enumerate the whole list as obviously LINQ can't assume anything about your data.
If you have a SortedList, you can exploit that using the SkipWhile/TakeWhile linq methods:
EDIT
@Davy8 is right of course that worst case this still has the same performance. See the other answers for a way to more quickly find the first value.
If you need to do this operation more than once for different age ranges then you can probably also speed it up by grouping on age:
LINQ 查询语法实际上使用与签名匹配的任何名为
Where
的扩展方法,因此您始终可以编写自己的扩展方法来按照您想要的方式处理特定类型。...
然后你可以应用任何你想要的逻辑。我相信正常的方法重载规则适用于确定将调用哪个
Where
扩展方法。The LINQ query syntax actually uses any extension method named
Where
that matches the signature, so you can always write your own that handles your specific type the way you want....
Then you can apply any logic you want. I believe the normal method overload rules apply for determining which
Where
extension method would be called.在预排序容器中,通过快速找到第一个元素来实现效率。找到第一个元素后,只需线性检索以下元素,直到找到范围的末尾。
假设您的
SortedList
按Person.Age
排序,您可以使用SortedList.IndexOfKey
找到该范围的第一个元素,它是一个 二分搜索算法;因此,该方法是一个 O(log n) 操作。(我认为您无法更改代码,因此
Enumerable.Where
突然变得更加智能,并通过使用二分搜索找到范围开始。)-- - 编辑 ---
实际上,您真正需要的是
List.BinarySearch
或Array.BinarySearch
。如果不存在精确匹配,SortedList.IndexOfKey
不会让您获取最接近匹配的索引。或者您可以自己实现二分搜索。In a pre-sorted container, the efficiency is achieved by finding the first element quickly. Once you find the first element, just linearly retrieve the following elements until you find the end of your range.
Assuming your
SortedList
is sorted byPerson.Age
, you can find the first element of the range usingSortedList.IndexOfKey
, which is a binary search algorithm; therefore, this method is an O(log n) operation.(I don't think you can change your code so the
Enumerable.Where
suddenly becomes more intelligent and finds the range start by using binary search.)--- EDIT ---
Actually, what you really need is
List.BinarySearch
orArray.BinarySearch
. TheSortedList.IndexOfKey
won't let you get the index of the closest match in case exact match does not exist. Or you can just implement the binary search yourself.