Linq 和二分搜索 - 改进这个缓慢的Where 语句？

发布于 2024-08-02 09:34:46 字数 512 浏览 5 评论 0原文

我有两个藏品，每个藏品大约有 40,000 件。

列表 2 中的元素通过外键链接到列表 1 的元素。

对于列表一的每个元素，我想在列表二中找到相应的元素。

像这样的事情：

foreach(var item in list1)
{
  var match = list2.Where(child => child.ID == item.ChildID).FirstOrDefault();
  item.Child = match;
}

这可行，但速度慢得要命。

现在，list1 和 list 2 都根据数据库中的这些键进行排序。因此 list1 按 ChildID 排序，list2 按 ID 排序（相同的值）。

我认为二分搜索会大大加快速度，但我在某处读到 Linq 会为Where 子句中的列表选择最合适的策略。也许我需要显式转换为排序列表？或者也许我需要使用比较器实现自定义二分搜索算法？

任何见解表示赞赏。

谢谢。

原文

I've got two collections, each containing about 40,000 items.

The elements in list 2 are linked to the elements of list one via a foreign key.

For each element of list one, I want to find the corresponding element in list two.

Something like this:

foreach(var item in list1)
{
  var match = list2.Where(child => child.ID == item.ChildID).FirstOrDefault();
  item.Child = match;
}

This works but it's slow as hell.

Now, both list1 and list 2 are sorted by these keys from the database. So list1 is ordered by ChildID and list2 is ordered by ID (same value).

I think a Binary search would dramatically speed this up, but I read somewhere that Linq would choose the most appropriate strategy for the list in the Where clause. Maybe I need to explicitly cast to a sorted list? Or maybe I need to implement a custom Binary Search algorithm w/ a comparer?

Any insights are appreciated.

Thanks.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

十年九夏 2024-08-09 09:34:46

为什么不使用联接？

var query = 
   from a in list1
   join b in list2 on a.ChildID equals b.ID
   select new {Item1 = a, Item2 = b};

foreach(var item in query)
{
   item.Item1.Child = item.Item2;
}

Why not use a join?

var query = 
   from a in list1
   join b in list2 on a.ChildID equals b.ID
   select new {Item1 = a, Item2 = b};

foreach(var item in query)
{
   item.Item1.Child = item.Item2;
}

回复收藏 0 原文

-黛色若梦 2024-08-09 09:34:46

这个怎么样：

        var joined = list1.Join(list2, x => x.ChildID, x => x.ID, (x, y) => new { x, y });

        foreach (var j in joined)
        {
            j.x.Child = j.y;
        }

？

What about this:

        var joined = list1.Join(list2, x => x.ChildID, x => x.ID, (x, y) => new { x, y });

        foreach (var j in joined)
        {
            j.x.Child = j.y;
        }

回复收藏 0 原文

烟雨扶苏 2024-08-09 09:34:46

我以前遇到过这个问题，与基于 DB 的搜索相比，基于 LINQ 的搜索速度非常慢，因为它没有使用任何索引。

您是否考虑过使用字典而不是列表？

您可以实现一个字典，然后不使用Where，而是使用 ContainsKey 如果确实存在，请使用索引访问器</a>。

示例代码：

Dictionary<int, Child> list2 = ...;

...

foreach(var item in list1)
{
  if (list2.ContainsKey(item.ChildID))
    item.Child = list2[item.ChildID];
}

使用索引进行访问比搜索列表要快得多，但代价是索引需要额外的内存。

I had this problem before, LINQ-based searching is extremely slow compared to DB-based because it's not utilizing any index.

Have you considered using a Dictionary instead of List?

You can implement a Dictionary and then instead of using Where, you can use ContainsKey and if it does exist, get the value using index accessor.

Sample code:

Dictionary<int, Child> list2 = ...;

...

foreach(var item in list1)
{
  if (list2.ContainsKey(item.ChildID))
    item.Child = list2[item.ChildID];
}

Access using index would be significantly faster than searching a list, on the cost of extra memory required for the index.

回复收藏 0 原文

心清如水 2024-08-09 09:34:46

由于两个列表都按相同的值排序，因此您可以并行循环它们：

int index1 = 0, index2 = 0;
while (index1 < list1.Count && index2 < list2.Count) {
   while (index1 < list1.Count && list1[index1].ChildId < list2[index2].Id) index1++;
   if (index1 < list1.Count) {
      while (index2 < list2.Count && list2[index2].Id < list1[index1].ChildId) index2++;
      if (index2 < list2.Count && list1[index1].ChildId == list2[index2].Id) {
         list1[index].Child = list2[index2];
         index1++;
         index2++;
      }
   }
}

或者：

int index1 = 0, index2 = 0;
while (index1 < list1.Count && index2 < list2.Count) {
   if (list1[index1].ChildId == list2[index2].Id) {
      list1[index].Child = list2[index2];
      index1++;
      index2++;
   } else {
      if (list1[index1].ChildId < list2[index2].Id) {
         index1++;
      } else {
         index2++;
      }
   }
}

另一种有效的替代方案，但不利用列表的顺序，是通过放置其中一个列表来创建索引在字典中：

Dictionary<int, TypeOfChild> index = new Dictionary<int, TypeOfChild>();
foreach (TypeOfChild child in list2) {
   index.Add(child.Id, child);
}
foreach (TypeOfParent parent in list1) {
   TypeOfChild child;
   if (index.TryGetValue(parent.ChildId, out child) {
      parent.Child = child;
   }
}

As both lists are sorted on the same value, you can just loop through them in parallel:

int index1 = 0, index2 = 0;
while (index1 < list1.Count && index2 < list2.Count) {
   while (index1 < list1.Count && list1[index1].ChildId < list2[index2].Id) index1++;
   if (index1 < list1.Count) {
      while (index2 < list2.Count && list2[index2].Id < list1[index1].ChildId) index2++;
      if (index2 < list2.Count && list1[index1].ChildId == list2[index2].Id) {
         list1[index].Child = list2[index2];
         index1++;
         index2++;
      }
   }
}

or:

int index1 = 0, index2 = 0;
while (index1 < list1.Count && index2 < list2.Count) {
   if (list1[index1].ChildId == list2[index2].Id) {
      list1[index].Child = list2[index2];
      index1++;
      index2++;
   } else {
      if (list1[index1].ChildId < list2[index2].Id) {
         index1++;
      } else {
         index2++;
      }
   }
}

Another efficient alternative, but which doesn't take advantage of the order of the lists, is to create an index by putting one of the lists in a dictionary:

Dictionary<int, TypeOfChild> index = new Dictionary<int, TypeOfChild>();
foreach (TypeOfChild child in list2) {
   index.Add(child.Id, child);
}
foreach (TypeOfParent parent in list1) {
   TypeOfChild child;
   if (index.TryGetValue(parent.ChildId, out child) {
      parent.Child = child;
   }
}

回复收藏 0 原文

油饼 2024-08-09 09:34:46

我忍不住回答这个问题:-)

你的代码变慢的主要原因是你的项目将被多次读取。速度的艺术是：只在需要时才读取内存，如果需要读取，则尽可能少地读取。

下面是一个示例：

代码

public class Item
{    
   private int _id;
   private List<ItemDetails> _detailItems = new List<ItemDetails>();

   public Item(int id)<br>
   {
       _id = id;
   }

   public void AddItemDetail(ItemDetails itemDetail)
   {
       _detailItems.Add(itemDetail);
   }

   public int Id
   {
       get { return _id; }
   }
   public ReadOnlyCollection<ItemDetails> DetailItems
   {
       get { return _detailItems.AsReadOnly(); }
   }
}


public class ItemDetails
{
    private int _parentId;

    public ItemDetails(int parentId)
    {
        _parentId = parentId;
    }

    public int ParentId
    {
        get { return _parentId; }
    }
}

示例代码：

主要目标是扫描列表并比较当前索引上的 item 和 itemdetail。当parentid等于它的parents id时。将其添加到列表中并继续下一个详细信息。如果不同，则继续寻找下一位父母。

// for performance tests..
DateTime startDateTime;

// create 2 lists  (master/child list)
List<Item> itemList = new List<Item>();
List<ItemDetails> itemDetailList = new List<ItemDetails>();

Debug.WriteLine("# Adding items");
startDateTime = DateTime.Now;

// add items (sorted)
for (int i = 0; i < 400000; i++)
    itemList.Add(new Item(i));

// show how long it took
Debug.WriteLine("Total milliseconds: " + (DateTime.Now - startDateTime).TotalMilliseconds.ToString("0") + "ms" );

// adding some random details (also sorted)
Debug.WriteLine("# Adding itemdetails");
Random rnd = new Random(DateTime.Now.Millisecond);

startDateTime = DateTime.Now;

int index = 0;
for (int i = 0; i < 800000; i++)
{
    // when the random number is bigger than 2, index will be increased by 1
    index += rnd.Next(5) > 2 ? 1 : 0;
    itemDetailList.Add(new ItemDetails(index));
}
Debug.WriteLine("Total milliseconds: " + (DateTime.Now - startDateTime).TotalMilliseconds.ToString("0") + "ms");

// show how many items the lists contains
Debug.WriteLine("ItemList Count: " + itemList.Count());
Debug.WriteLine("ItemDetailList Count: " + itemDetailList.Count());

// matching items
Debug.WriteLine("# Matching items");
startDateTime = DateTime.Now;

int itemIndex = 0;
int itemDetailIndex = 0;

int itemMaxIndex = itemList.Count;
int itemDetailMaxIndex = itemDetailList.Count;

// while we didn't reach any end of the lists, continue...
while ((itemIndex < itemMaxIndex) && (itemDetailIndex < itemDetailMaxIndex))
{
    // if the detail.parentid matches the item.id. add it to the list.
    if (itemList[itemIndex].Id == itemDetailList[itemDetailIndex].ParentId)
    {
        itemList[itemIndex].AddItemDetail(itemDetailList[itemDetailIndex]);
        // increase the detail index.
        itemDetailIndex++;
    }
    else
        // the detail.parentid didn't matches the item.id so check the next 1
        itemIndex++;
}

Debug.WriteLine("Total milliseconds: " + (DateTime.Now - startDateTime).TotalMilliseconds.ToString("0") + "ms");

结果

我多花了 10 倍的项目才能看到更好的结果：

添加项目：
总毫秒数：140ms
添加项目详细信息：
总毫秒数：203ms
项目列表数量：400000
项目详细列表计数：800000
配套项目：
总毫秒数：265ms

打字速度很快，而且还可以更简洁。所以我希望你能读一下。玩它。

问候，
杰罗恩.

I couldn't resist to answer this :-)

The main reason your code gets slow, is that your items will be readed many many times. The art of speed is: Read memory only if you need to and if you need to read it, read it as less as possible.

Here is an example:

code

public class Item
{    
   private int _id;
   private List<ItemDetails> _detailItems = new List<ItemDetails>();

   public Item(int id)<br>
   {
       _id = id;
   }

   public void AddItemDetail(ItemDetails itemDetail)
   {
       _detailItems.Add(itemDetail);
   }

   public int Id
   {
       get { return _id; }
   }
   public ReadOnlyCollection<ItemDetails> DetailItems
   {
       get { return _detailItems.AsReadOnly(); }
   }
}


public class ItemDetails
{
    private int _parentId;

    public ItemDetails(int parentId)
    {
        _parentId = parentId;
    }

    public int ParentId
    {
        get { return _parentId; }
    }
}

example code:

The main goal is to scan the lists and compare the item and itemdetail on the current indexes. When the parentid is equal to it's parents id. add it to the list and go on to the next detail. If it is different go on to the next parent.

// for performance tests..
DateTime startDateTime;

// create 2 lists  (master/child list)
List<Item> itemList = new List<Item>();
List<ItemDetails> itemDetailList = new List<ItemDetails>();

Debug.WriteLine("# Adding items");
startDateTime = DateTime.Now;

// add items (sorted)
for (int i = 0; i < 400000; i++)
    itemList.Add(new Item(i));

// show how long it took
Debug.WriteLine("Total milliseconds: " + (DateTime.Now - startDateTime).TotalMilliseconds.ToString("0") + "ms" );

// adding some random details (also sorted)
Debug.WriteLine("# Adding itemdetails");
Random rnd = new Random(DateTime.Now.Millisecond);

startDateTime = DateTime.Now;

int index = 0;
for (int i = 0; i < 800000; i++)
{
    // when the random number is bigger than 2, index will be increased by 1
    index += rnd.Next(5) > 2 ? 1 : 0;
    itemDetailList.Add(new ItemDetails(index));
}
Debug.WriteLine("Total milliseconds: " + (DateTime.Now - startDateTime).TotalMilliseconds.ToString("0") + "ms");

// show how many items the lists contains
Debug.WriteLine("ItemList Count: " + itemList.Count());
Debug.WriteLine("ItemDetailList Count: " + itemDetailList.Count());

// matching items
Debug.WriteLine("# Matching items");
startDateTime = DateTime.Now;

int itemIndex = 0;
int itemDetailIndex = 0;

int itemMaxIndex = itemList.Count;
int itemDetailMaxIndex = itemDetailList.Count;

// while we didn't reach any end of the lists, continue...
while ((itemIndex < itemMaxIndex) && (itemDetailIndex < itemDetailMaxIndex))
{
    // if the detail.parentid matches the item.id. add it to the list.
    if (itemList[itemIndex].Id == itemDetailList[itemDetailIndex].ParentId)
    {
        itemList[itemIndex].AddItemDetail(itemDetailList[itemDetailIndex]);
        // increase the detail index.
        itemDetailIndex++;
    }
    else
        // the detail.parentid didn't matches the item.id so check the next 1
        itemIndex++;
}

Debug.WriteLine("Total milliseconds: " + (DateTime.Now - startDateTime).TotalMilliseconds.ToString("0") + "ms");

results

I took 10 times more items to see a better result:

Adding items:
Total milliseconds: 140ms
Adding itemdetails:
Total milliseconds: 203ms
ItemList Count: 400000
ItemDetailList Count: 800000
Matching items:
Total milliseconds: 265ms

This was typed fast and could be cleaner. So i hope you can read it. Play with it.

Greetz,
Jeroen.

回复收藏 0 原文

一页 2024-08-09 09:34:46

不确定这是否真的会加快速度，但您可以将谓词移动到 FirstOrDefault() 子句中并完全删除 where 。

item.Child = list2.FirstOrDefault(child => child.ID == item.ChildID)

它实际上可能没有帮助，因为这可能只是隐式调用Where()。

但这里有一个问题，该方法实际上很慢还是仅在编译后第一次运行时才慢？查看对此的讨论发布。

Not sure if this would actually speed it up any, but you can move the predicate into the FirstOrDefault() clause and get rid of where entirely.

item.Child = list2.FirstOrDefault(child => child.ID == item.ChildID)

It might not actually help since this might implicitly just call Where().

Here's a question though, is the method actually slow or is it slow only the first time it is run after a compilation? Check out the discussion on this post.

回复收藏 0 原文

~没有更多了~