C# 匹配内存数据中的固定属性
伙计们,这让我绞尽脑汁
,我在 Linq 中有一个缓慢的实现,我需要加快速度。
我需要尽可能最快的方法来比较内存(200,000)行中的大量只读数据。
每行都有一个名称和 10 个表示名称属性的整数。
Brake 8 7 3 2 1 0 4 3 2
Skull 8 7 3 2 1 0 4 3 2
Napkin 3 0 5 3 2 1 3 1 0
每个项目的任何单个属性都不会超过 8 个。所以每个值都可以很好地适合双
精度:
Brake:873,210,432
现在我的问题是我需要将这些值与每个字段的最大数量进行比较
,即:
500,000,000 将返回前 2 个,
000,001,000 将仅返回 Napkin,因为其他 2 个没有至少有 1 个处于第 4 位置。
我愿意接受任何尽可能快的解决方案。会有很多比较,所以速度是我在这里唯一担心的事情。
我当前的实现是这样的:
Items.Where(c => c.A <= 5).Where(c => c.B <= 0).Where(c => c.C <= 0)
.Where(c => c.D <= 0).Where(c => c.E <= 0).Where(c => c.F <= 0)
.Where(c => c.H <= 0).Where(c => c.I <= 0).Where(c => c.J <= 0)
或者(它们的运行速度大致相同)
Items.Where(c => c.A <= 5 && c.B <= 0 && c.C <= 0
&& c.D <= 0 && c.E <= 0 && c.F <= 0 && c.H <= 0
&& c.I <= 0 && c.J <= 0)
Guys this is racking my brain
I have a slow implementation in Linq that I need to speed up.
I need the fastest possible way to compare a large set of readonly data in memory (200,000) rows.
Each row has a name and 10 integers representing properties on the name
Brake 8 7 3 2 1 0 4 3 2
Skull 8 7 3 2 1 0 4 3 2
Napkin 3 0 5 3 2 1 3 1 0
Each item will never have more then 8 in any single property. So each value can fit nicely into a double
ex:
Brake:873,210,432
Now my problem is I need to compare these values to a maximum number for each field
ie:
500,000,000 would return the first 2 and
000,001,000 would only return Napkin because the other 2 don't have at least 1 in the 4th position.
I'm open to any soltuion that is as fast as possible. There will be a lot of comparisons so speed is the only thing I'm worried about here.
My current implementation would be something like this:
Items.Where(c => c.A <= 5).Where(c => c.B <= 0).Where(c => c.C <= 0)
.Where(c => c.D <= 0).Where(c => c.E <= 0).Where(c => c.F <= 0)
.Where(c => c.H <= 0).Where(c => c.I <= 0).Where(c => c.J <= 0)
Or (they both run about the same speed)
Items.Where(c => c.A <= 5 && c.B <= 0 && c.C <= 0
&& c.D <= 0 && c.E <= 0 && c.F <= 0 && c.H <= 0
&& c.I <= 0 && c.J <= 0)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
实际上,有一个比我原来的答案更好的方法,但它需要每个项目 5 位,而不是每个项目 4 位。我将用 16 位值中的三个项目来说明它。您可以使用长整数的前 50 位将其扩展到 10 个项目。
想象一下,您将三个 4 位计数器保存在一个 16 位整数中。每个计数器都有一个用于计算的附加位(见下文)。因此,带有前三个计数器的 Brake 项目将是:
现在,您需要与“504”匹配的内容,即第一列中至少有五个项目,第二列中至少有五个项目,第三列中至少有四个项目。该掩码将是:
从逻辑上讲,我们想要从 Brake 中减去搜索,并知道所有必需字段是否大于或等于零。这就是额外位的用武之地。如果我们在每个字段上设置该额外位,我们将得到:
现在,如果我们在这种情况下进行减法,我们最终会得到:
请注意,结果最左侧字段中的高位为 0。这表明制动值该字段中的数字小于我们要查找的值。
现在,如何在代码中实现此功能。在我的示例中,使用
short
,我会:您可以通过查看
rslt
中的位来确定哪些字段不匹配。每第五位(即位 4、9 和 14)应为 1。如果该位为 0,则该字段不满足最小值。所以要比较单个项目,你必须进行三个操作和一个比较。这将比我之前的回答快得多。
实现
下面的实现假设您在字节数组中拥有每个名称的 10 个值。第一个方法
CreateValue
构建一个代表这 10 个值的长整数。我假设您拥有某种格式的数据,您可以方便地将每个项目的字段值放入 10 个字节的数组中。您需要添加一个
CombinedValue
属性来保存组合值。或者也许您有一个并行数据结构来保存组合值。无论如何,您必须在程序启动时执行此循环一次,以创建将用于比较的数据(或者如果可以更新各个字段,则可能更新该值)。我还假设,当需要搜索时,您可以将要查找的值放入字节数组中,然后调用 CreateValue 来获取您要搜索的组合值。现在您所需要的只是进行比较的方法。
那么,搜索您的列表就变得非常简单:
Actually, there's a better way than my original answer, but it requires 5 bits per item rather than 4 bits per item. I'll illustrate it with three items in a 16-bit value. You can extend it to your 10 items using the first 50 bits of a long integer.
Imagine that you're saving three 4-bit counters in a 16-bit integer. Each counter has an additional bit that's used for calculation (see below). So your Brake item with the first three counters would be:
Now, you want things that match "504", that is at least five items in the first column, any number in the second column, and at least four in the third column. That mask would be:
Logically, we want to subtract the Search from the Brake, and know if all the required fields are greater than or equal to zero. That's where the extra bit comes in. If we set that extra bit on each of the fields, we have:
Now, if we do a subtraction in this case we end up with:
Note that the high bit in the leftmost field of the result is 0. This indicates that the number that was in that field of the Brake value was smaller than the value we were looking for.
Now, how to make this work in code. In my example, using
short
, I'd have:You can determine which fields don't match by looking at the bits in the
rslt
. Every fifth bit (i.e. bits 4, 9, and 14) should be 1. If the bit is 0, then that field didn't meet the minimum.So to compare a single item, you have to do three operations and a comparison. That's going to be a whole lot faster than my earlier answer.
Implementation
The implementation below assumes that you have the 10 values for each name in an array of bytes. The first method,
CreateValue
builds a long integer that represents those 10 values.I'll assume that you have your data in some format where you can conveniently put the field values for each item into an array of 10 bytes. You'll want to add a
CombinedValue
property that holds the combined values. Or perhaps you have a parallel data structure to hold the combined values. Anyway, you have to do this loop once at program startup to create the data that you'll use for comparison (or perhaps update the value if you can update the individual fields).I'll assume, too, that when it comes time to search, you can put the values you're looking for into an array of bytes and call
CreateValue
to get the combined value you're searching for. Now all you need is the method that will do the comparison.Searching your list, then, becomes very simple:
您的数据由类似于以下的结构表示:
无论是字节数组还是从中提取的位字段,我怀疑与性能相当无关。
一种方法是构建 8 个并行数组,其中包含对小部件的引用,每个此类数组都在不同的值列上排序。然后,查找包括对所需值的二分搜索。
另一种方法是读取源数据一次,并填充高度平衡的二叉搜索树数组,其键是所需列的值,其值是共享该特定列值的 Widget 列表。每列都需要一个这样的搜索树。显然,这种方法会消耗内存来提高速度——高度平衡二叉树中的查找是一个 O(log N) 操作。使用树意味着可以在集合中添加和删除项目,而无需施加大量开销。
虽然您可以编写自己的树实现,但我不愿意。这是一个使用 C5 Collections Library 的实现(无论如何,你的工具箱中应该有这个东西)因为我讨厌发明轮子:
Your data is represented by a structure that resembles this:
Whether it's an array of bytes, or a bit field that you draw from is, I suspect fairly irrelevant to performance.
One approach would be to build 8 parallel arrays containing references to your widgets with each such array being sorted on a different value column. Lookup then consisted of a binary search for the desired values.
Another approach is to read your source data once, and populate an array of height-balanced binary search trees, whose key is the value of the desired column and whose value is the list of Widgets sharing that particular column value. You'll need one such search tree for each column. Obviously, this approach eats memory to get speed — lookup in a height-balance binary tree being a O(log N) operation. Using trees means that items can be added to and removed from the collection without imposing a lot of overhead.
Whilst you could write your own tree implementation, I'd rather not. Here is an implementation, using the C5 Collections Library (something you should have in your toolbox anyway) since I hate having to invent the wheel: