如何对 LINQ to Objects 查询进行分区?
这是一个资源分配问题。我的目标是运行查询来获取任何时间段的最高优先级班次。
数据集非常大。对于此示例,假设 1000 家公司每个班次有 100 个班次(尽管实际数据集更大)。它们都加载到内存中,我需要对它们运行单个 LINQ to Objects 查询:
var topShifts =
(from s in shifts
where (from s2 in shifts
where s2.CompanyId == s.CompanyId && s.TimeSlot == s2.TimeSlot
orderby s2.Priority
select s2).First().Equals(s)
select s).ToList();
问题是,如果不进行优化,LINQ to Objects 将比较两个集合中的每个对象,对所有 1,000 个对象进行交叉联接100 与 1,000 x 100,相当于 100 亿 (10,000,000,000) 次比较。我想要的是仅比较每个公司内的对象(就好像公司在 SQL 表中索引一样)。这应该会产生 1000 组 100 x 100 对象,总共进行 1000 万 (10,000,000) 次比较。随着公司数量的增长,后者将呈线性而非指数式扩展。
像 I4o 这样的技术将允许我做这样的事情,但不幸的是,我没有使用的奢侈我执行此查询的环境中的自定义集合。另外,我只希望在任何给定数据集上运行此查询一次,因此持久索引的价值是有限的。我期望使用一种扩展方法,该方法将按公司对数据进行分组,然后在每个组上运行表达式。
完整示例代码:
public struct Shift
{
public static long Iterations;
private int companyId;
public int CompanyId
{
get { Iterations++; return companyId; }
set { companyId = value; }
}
public int Id;
public int TimeSlot;
public int Priority;
}
class Program
{
static void Main(string[] args)
{
const int Companies = 1000;
const int Shifts = 100;
Console.WriteLine(string.Format("{0} Companies x {1} Shifts", Companies, Shifts));
var timer = Stopwatch.StartNew();
Console.WriteLine("Populating data");
var shifts = new List<Shift>();
for (int companyId = 0; companyId < Companies; companyId++)
{
for (int shiftId = 0; shiftId < Shifts; shiftId++)
{
shifts.Add(new Shift() { CompanyId = companyId, Id = shiftId, TimeSlot = shiftId / 3, Priority = shiftId % 5 });
}
}
Console.WriteLine(string.Format("Completed in {0:n}ms", timer.ElapsedMilliseconds));
timer.Restart();
Console.WriteLine("Computing Top Shifts");
var topShifts =
(from s in shifts
where (from s2 in shifts
where s2.CompanyId == s.CompanyId && s.TimeSlot == s2.TimeSlot
orderby s2.Priority
select s2).First().Equals(s)
select s).ToList();
Console.WriteLine(string.Format("Completed in {0:n}ms", timer.ElapsedMilliseconds));
timer.Restart();
Console.WriteLine("\nShifts:");
foreach (var shift in shifts.Take(20))
{
Console.WriteLine(string.Format("C {0} Id {1} T {2} P{3}", shift.CompanyId, shift.Id, shift.TimeSlot, shift.Priority));
}
Console.WriteLine("\nTop Shifts:");
foreach (var shift in topShifts.Take(10))
{
Console.WriteLine(string.Format("C {0} Id {1} T {2} P{3}", shift.CompanyId, shift.Id, shift.TimeSlot, shift.Priority));
}
Console.WriteLine(string.Format("\nTotal Comparisons: {0:n}", Shift.Iterations/2));
Console.WriteLine("Any key to continue");
Console.ReadKey();
}
}
示例输出:
1000 家公司 x 100 班次
填充数据
10.00 毫秒内完成
计算最高班次
在 520,721.00ms 内完成
班次:
C 0 ID 0 T 0 P0
C 0 ID 1 T 0 P1
C 0 Id 2 T 0 P2
C 0 ID 3 T 1 P3
C 0 ID 4 T 1 P4
C 0 ID 5 T 1 P0
C 0 Id 6 T 2 P1
C 0 ID 7 T 2 P2
C 0 Id 8 T 2 P3
C 0 ID 9 T 3 P4
C 0 ID 10 T 3 P0
C 0 ID 11 T 3 P1
C 0 ID 12 T 4 P2
C 0 ID 13 T 4 P3
C 0 ID 14 T 4 P4
C 0 ID 15 T 5 P0
C 0 ID 16 T 5 P1
C 0 ID 17 T 5 P2
C 0 ID 18 T 6 P3
C 0 ID 19 T 6 P4
最高班次:
C 0 ID 0 T 0 P0
C 0 ID 5 T 1 P0
C 0 Id 6 T 2 P1
C 0 ID 10 T 3 P0
C 0 ID 12 T 4 P2
C 0 ID 15 T 5 P0
C 0 ID 20 T 6 P0
C 0 ID 21 T 7 P1
C 0 ID 25 T 8 P0
C 0 ID 27 T 9 P2
比较总数:10,000,000,015.00
按任意键继续
问题:
- 如何对查询进行分区(同时仍作为单个 LinQ 查询执行)以便将比较从 100 亿减少到 1000 万?
- 有没有比子查询更有效的解决问题的方法?
This is a resource-allocation problem. My goal is to run a query to fetch the top-priority shift for any time-slot.
The dataset is very large. For this example, let’s say there are 100 shifts each for 1000 companies (though the real dataset is even larger). They are all loaded into memory, and I need to run a single LINQ to Objects query against them:
var topShifts =
(from s in shifts
where (from s2 in shifts
where s2.CompanyId == s.CompanyId && s.TimeSlot == s2.TimeSlot
orderby s2.Priority
select s2).First().Equals(s)
select s).ToList();
The problem is that without optimization, LINQ to Objects will compare each and every object in both sets, doing a cross-join of all 1,000 x 100 against 1,000 x 100, which amounts to 10 billion (10,000,000,000) comparisons. What I want is to compare only objects within each company (as if Company were indexed in a SQL table). This should result in 1000 sets of 100 x 100 objects for a total of 10 million (10,000,000) comparisons. The later would scale linearly rather than exponentially as the number of companies grows.
Technologies like I4o would allow me to do something like this, but unfortunately, I don’t have the luxury of using a custom collection in the environment in which I’m executing this query. Also, I only expect to run this query once on any given dataset, so the value of a persistent index is limited. I’m expecting to use an extension method which would group the data by company, then run the expression on each group.
Full Sample code:
public struct Shift
{
public static long Iterations;
private int companyId;
public int CompanyId
{
get { Iterations++; return companyId; }
set { companyId = value; }
}
public int Id;
public int TimeSlot;
public int Priority;
}
class Program
{
static void Main(string[] args)
{
const int Companies = 1000;
const int Shifts = 100;
Console.WriteLine(string.Format("{0} Companies x {1} Shifts", Companies, Shifts));
var timer = Stopwatch.StartNew();
Console.WriteLine("Populating data");
var shifts = new List<Shift>();
for (int companyId = 0; companyId < Companies; companyId++)
{
for (int shiftId = 0; shiftId < Shifts; shiftId++)
{
shifts.Add(new Shift() { CompanyId = companyId, Id = shiftId, TimeSlot = shiftId / 3, Priority = shiftId % 5 });
}
}
Console.WriteLine(string.Format("Completed in {0:n}ms", timer.ElapsedMilliseconds));
timer.Restart();
Console.WriteLine("Computing Top Shifts");
var topShifts =
(from s in shifts
where (from s2 in shifts
where s2.CompanyId == s.CompanyId && s.TimeSlot == s2.TimeSlot
orderby s2.Priority
select s2).First().Equals(s)
select s).ToList();
Console.WriteLine(string.Format("Completed in {0:n}ms", timer.ElapsedMilliseconds));
timer.Restart();
Console.WriteLine("\nShifts:");
foreach (var shift in shifts.Take(20))
{
Console.WriteLine(string.Format("C {0} Id {1} T {2} P{3}", shift.CompanyId, shift.Id, shift.TimeSlot, shift.Priority));
}
Console.WriteLine("\nTop Shifts:");
foreach (var shift in topShifts.Take(10))
{
Console.WriteLine(string.Format("C {0} Id {1} T {2} P{3}", shift.CompanyId, shift.Id, shift.TimeSlot, shift.Priority));
}
Console.WriteLine(string.Format("\nTotal Comparisons: {0:n}", Shift.Iterations/2));
Console.WriteLine("Any key to continue");
Console.ReadKey();
}
}
Sample output:
1000 Companies x 100 Shifts
Populating data
Completed in 10.00ms
Computing Top Shifts
Completed in 520,721.00ms
Shifts:
C 0 Id 0 T 0 P0
C 0 Id 1 T 0 P1
C 0 Id 2 T 0 P2
C 0 Id 3 T 1 P3
C 0 Id 4 T 1 P4
C 0 Id 5 T 1 P0
C 0 Id 6 T 2 P1
C 0 Id 7 T 2 P2
C 0 Id 8 T 2 P3
C 0 Id 9 T 3 P4
C 0 Id 10 T 3 P0
C 0 Id 11 T 3 P1
C 0 Id 12 T 4 P2
C 0 Id 13 T 4 P3
C 0 Id 14 T 4 P4
C 0 Id 15 T 5 P0
C 0 Id 16 T 5 P1
C 0 Id 17 T 5 P2
C 0 Id 18 T 6 P3
C 0 Id 19 T 6 P4
Top Shifts:
C 0 Id 0 T 0 P0
C 0 Id 5 T 1 P0
C 0 Id 6 T 2 P1
C 0 Id 10 T 3 P0
C 0 Id 12 T 4 P2
C 0 Id 15 T 5 P0
C 0 Id 20 T 6 P0
C 0 Id 21 T 7 P1
C 0 Id 25 T 8 P0
C 0 Id 27 T 9 P2
Total Comparisons: 10,000,000,015.00
Any key to continue
Questions:
- How can I partition the query (while still executing as a single LinQ query) in order to get the comparisons down from 10 billion to 10 million?
- Is there a more efficient way of solving the problem instead of a sub-query?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
似乎
得到了相同的输出,但
与 @Geoff 的编辑进行了 100015 次比较,他只是将我的比较减少了一半:-)
How about
Seems to get the same output but 100015 comparisons
with @Geoff's edit he just halved my comparisons :-)
您是否尝试过使用
group by
:Have you tried using
group by
:老实说,我有点不确定你想要什么,但通过阅读你的代码,我想说你可以做类似的事情
I'm a bit unsure about what you want to be honest but from Reading your code I'd say you could do something like