我有很多真/假结果保存为 long[]
数组中的位。我确实有大量这样的东西(数以百万计的长头)。
例如,假设我只有五个结果,我会:
+----- condition 5 is true
|
|+---- condition 4 is false
||
||+--- condition 3 is true
|||
|||+-- condition 2 is true
||||
||||+- condition 1 is false
10110
我也有一些树代表这样的语句:
condition1 AND (condition2 OR (condition3 AND condition 4))
树非常简单但很长。它们基本上看起来像这样(下面是一个过度简化的例子,只是为了展示我所得到的):
class Node {
int operator();
List<Node> nodes;
int conditionNumber();
}
基本上,节点是一个叶子,然后有一个条件数(匹配 long[] 数组中的一位)或节点不是叶子,因此引用多个子节点。
它们很简单,但允许表达复杂的布尔表达式。效果很好。
到目前为止一切都很好,一切都很好。但是我确实有一个问题:我需要评估很多表达式,确定它们是真还是假。基本上,我需要对一个问题进行一些暴力计算,对于这个问题,没有比暴力破解更好的解决方案。
因此,我需要遍历树并根据树的内容和 long[] 的内容回答 true
或 false
。
我需要优化的方法如下所示:
boolean solve( Node node, long[] trueorfalse ) {
...
}
第一次调用时,node
是根节点,然后显然是子节点(是递归的,solve
方法调用自身)。
知道我只有几棵树(可能最多一百棵左右)但有数百万个 long[]
需要检查,我可以采取哪些步骤来优化它?
明显的递归解决方案传递参数((子)树和 long[],我可以通过不将其作为参数传递来摆脱 long[]
),并且所有递归解决方案都非常慢我需要检查使用了哪个运算符(AND、OR 或 NOT 等),并且涉及大量 if/else 或 switch 语句。
我不是在寻找另一种算法(没有),所以我不是在寻找从 O(x) 到 O(y) 的算法,其中 y 小于 x。
我正在寻找的是“x倍”加速:如果我可以编写执行速度快5倍的代码,那么我将获得5倍的加速,仅此而已,我会对此感到非常满意。
到目前为止,我看到的唯一增强功能——我认为与我现在所拥有的相比,这将是一个巨大的“倍x”加速——是为每棵树生成字节码并具有逻辑对于每棵树都硬编码到一个类中。它应该工作得很好,因为我只会有一百棵左右的树(但树不是固定的:我无法提前知道树会是什么样子,否则简单地手动硬编码每棵树将是微不足道的)。
除了为每棵树生成字节码之外,还有什么想法吗?
现在如果我想尝试字节码生成路线,我应该怎么做?
I've got a lot of true/false results saved as bits in long[]
arrays. I do have a huge number of these (millions and millions of longs).
For example, say I have only five results, I'd have:
+----- condition 5 is true
|
|+---- condition 4 is false
||
||+--- condition 3 is true
|||
|||+-- condition 2 is true
||||
||||+- condition 1 is false
10110
I also do have a few trees representing statements like:
condition1 AND (condition2 OR (condition3 AND condition 4))
The trees are very simple but very long. They basically look like this (it's an oversimplification below, just to show what I've got):
class Node {
int operator();
List<Node> nodes;
int conditionNumber();
}
Basically either the Node is a leaf and then has a condition number (matching one of the bit in the long[] arrays) or the Node is not a leaf and hence refers several subnodes.
They're simple yet they allow to express complicated boolean expressions. It works great.
So far so good, everything is working great. However I do have a problem: I need to evaluate a LOT of expressions, determining if they're true or false. Basically I need to do some brute-force computation for a problem for which there's no know better solution than brute-forcing.
So I need to walk the tree and answer either true
or false
depending on the content of the tree and the content of the long[]
.
The method I need to optimize looks like this:
boolean solve( Node node, long[] trueorfalse ) {
...
}
where on the first call the node
is the root node and then, obviously, subnodes (being recursive, that solve
method calls itself).
Knowing that I'll only have a few trees (maybe up to a hundred or so) but millions and millions of long[]
to check, what steps can I take to optimize this?
The obvious recursive solution passes parameters (the (sub)tree and the long[], I could get rid of the long[]
by not passing it as a parameter) and is quite slow with all the recursive calls etc. I need to check which operator is used (AND or OR or NOT etc.) and there's quite a lot of if/else or switch statements involved.
I'm not looking for another algorithm (there aren't) so I'm not looking for going from O(x) to O(y) where y would be smaller than x.
What I'm looking for is "times x" speedup: if I can write code performing 5x faster, then I'll have a 5x speedup and that's it and I'd be very happy with it.
The only enhancement I see as of now --and I think it would be a huge "times x" speedup compared to what I have now-- would be to generate bytecode for every tree and have the logic for every tree hardcoded into a class. It should work well because I'll only ever have a hundred or so trees (but the trees aren't fixed: I cannot know in advance how the trees are going to look like, otherwise it would be trivial to simply hardcode manually every tree).
Any idea besides generating bytecode for every tree?
Now if I want to try the bytecode generation route, how should I go about it?
发布评论
评论(3)
为了最大化捷径评估的机会,您需要进行自己的分支预测。
您可能想要对其进行分析,计算
然后您可以相对于在分析步骤中找到的权重对树重新排序。如果您想要/需要特别聪明,您可以设计一种机制来在运行时检测特定数据集的权重,以便您可以动态地重新排序分支。
请注意,在后一种情况下,建议不要对实际树重新排序(相对于仍在执行时的存储效率和结果的正确性),而是设计一个树节点访问者(遍历算法) )能够根据“实时”权重对分支进行本地排序。
我希望这一切都有意义,因为我意识到散文版本很密集。然而,就像 Fermat 所说,代码示例太大,无法容纳在这个边距中:)
In order to maximize the opportunities for shortcut evaluation, you need to do your own branch prediction.
You might want to profile it, tallying
You can then reorder the tree relative to the weights that you found in the profiling step. If you want/need to be particularly nifty, you can devise a mechanism that detects the weighting for a certain dataset during runtime, so you can reorder the branches on the fly.
Note that in the latter case, it might be advisable to not reorder the actual tree (with respect to storage efficiency and correctness of result while still executing), but rather devise a tree-node visitor (traversal algorithm) that is able to locally sort the branches according to the 'live' weights.
I hope all of this makes sense, because I realize the prose version is dense. However, like Fermat said, the code example is too big to fit into this margin :)
在 C 语言中,有一种简单快速的方法来评估这样的布尔运算。假设你想评估 z=(x op y),你可以这样做:
所以 op 将是 4 的倍数来选择你的运算 AND、OR、XOR等等,您为所有可能的答案创建一个查找表。如果这个表足够小,您可以将其编码为单个值,并使用右移和掩码来选择输出位:
这将是评估大量这些值的最快方法。当然,您必须将具有多个输入的操作拆分为每个节点只有 2 个输入的树。然而,没有简单的方法可以解决这个问题。您可以将树转换为列表,其中每个项目包含操作编号以及指向 2 个输入和输出的指针。一旦采用列表形式,您可以使用单个循环非常快速地遍历该行一百万次。
对于小树来说,这是一个胜利。对于具有短路的较大树来说,这可能不是一个胜利,因为需要评估的平均分支数量从 2 到 1.5,这对于大型树来说是一个巨大的胜利。 YMMV。
编辑:
再想一想,您可以使用诸如跳跃列表之类的东西来实现短路。每个操作(节点)将包括一个比较值和一个跳过计数。如果结果与比较值匹配,则可以绕过下一个跳过计数值。因此,列表将通过树的深度优先遍历来创建,并且第一个子节点将包含等于另一个子节点大小的跳过计数。这使每个节点评估变得更加复杂,但允许短路。仔细的实现可以在没有任何条件检查的情况下做到这一点(想想跳跃计数的 1 或 0 倍)。
There is a simple and fast way to evaluate boolean operations like this in C. Assuming you want to evaluate z=(x op y) you can do this:
So op will be a multiple of 4 to select your operation AND, OR, XOR, etc. you create a lookup table for all possible answers. If this table is small enough, you can encode it into a single value and use right shift and a mask to select the output bit:
That would be the fastest way to evaluate large numbers of these. Of course you'll have to split operations with multiple inputs into trees where each node has only 2 inputs. There is no easy way to short circuit this however. You can convert the tree into a list where each item contains the operation number and pointers to the 2 inputs and output. Once in list form, you can use a single loop to blow through that one line a million times very quickly.
For small trees, this is a win. For larger trees with short circuiting it's probably not a win because the average number of branches that need to be evaluated goes from 2 to 1.5 which is a huge win for large trees. YMMV.
EDIT:
On second thought, you can use something like a skip-list to implement short circuiting. Each operation (node) would include a compare value and a skip-count. if the result matched the compare value, you can bypass the next skip-count values. So the list would be created from a depth-first traversal of the tree, and the first child would include a skip count equal to the size of the other child. This takes a bit more complexity to each node evaluation but allows short circuiting. Careful implementation could do it without any condition checking (think 1 or 0 times the skip-count).
我认为你的字节编码想法是正确的方向。
无论语言如何,我要做的就是编写一个预编译器。
它会遍历每棵树,并使用 print 语句将其转换为源代码,例如。
每当树发生变化时,就可以动态编译,并加载生成的字节代码/dll,所有这些都需要不到一秒钟的时间。
问题是,目前你正在解释树的内容。
将它们转换成编译后的代码应该会使它们的运行速度提高 10-100 倍。
添加是为了回应您关于没有 JDK 的评论。然后,如果无法生成 Java 字节码,我会尝试编写自己的字节码解释器,并且运行速度尽可能快。它可能看起来像这样:
这个想法是让编译器将开关转换为跳转表,因此它以最少的周期数执行每个操作。
要生成操作码,您只需对树进行后缀遍历即可。
最重要的是,您也许可以通过对德摩根定律进行一些操作来简化它,这样您就可以一次检查多个位。
I think your byte-coding idea is the right direction.
What I would do, regardless of language, is write a precompiler.
It would walk each tree, and use print statements to translate it into source code, such as.
That can be compiled on the fly whenever the trees change, and the resulting byte code / dll loaded, all of which takes under a second.
The thing is, at present you are interpreting the contents of the trees.
Turning them into compiled code should make them run 10-100 times faster.
ADDED in response to your comments on not having the JDK. Then, if you can't generate Java byte code, I would try to write my own byte-code interpreter than would run as fast as possible. It could look something like this:
The idea is to cause the compiler to turn the switch into a jump table, so it performs each operation with the smallest number of cycles.
To generate the opcodes, you can just do a postfix walk of the tree.
On top of that, you might be able to simplify it by some manipulation of De Morgan's laws, so you could check more than one bit at a time.