高效的笛卡尔积算法
有人可以为我演示一种比我当前使用的算法更有效的笛卡尔积算法吗(假设有一个)。我环顾四周并用谷歌搜索了一下,但看不到任何明显的东西,所以我可能会错过一些东西。
foreach (int i in is) {
foreach (int j in js) {
//Pair i and j
}
}
这是我在代码中所做的高度简化的版本。这两个整数是查找键,用于检索一个/多个对象,并且两次查找中的所有对象都会配对成新对象。
随着数据集规模的扩大,在更大、更复杂的系统中的这一小代码块成为主要的性能瓶颈。其中一些问题可能可以通过改进用于存储对象和所涉及的查找的数据结构来缓解,但我认为主要问题仍然是笛卡尔积本身的计算。
编辑
因此,我有更多关于我对算法的具体使用的背景知识,看看是否有任何技巧可以用来回应 Marc 的评论。整个系统是一个 SPARQL 查询引擎,它处理对图数据集的 SPARQL 查询,SPARQL 是一种基于模式的语言,因此每个查询都包含一系列与图匹配的模式。在两个后续模式没有公共变量(它们不相交)的情况下,有必要计算两个模式产生的解决方案的笛卡尔积,以获得整个查询的可能解决方案集。可能有任意数量的模式,并且我可能必须多次计算笛卡尔积,如果查询由一系列不相交的模式组成,这可能会导致可能的解决方案呈指数级扩展。
不知何故,从现有的答案来看,我怀疑是否有任何技巧可以应用
更新
所以我想我会发布我实现的更新,以最大限度地减少笛卡尔积的需要,从而优化查询引擎一般来说。请注意,并不总是可以完全消除对产品的需求,但几乎总是可以进行优化,以便连接的两个集合的大小要小得多。
由于每个 BGP(基本图形模式)是一组三重模式,都作为一个块(本质上)执行,因此引擎可以自由地对 BGP 内的模式进行重新排序,以获得最佳性能。例如,考虑以下 BGP:
?a :someProperty ?b .
?c :anotherProperty ?d .
?b a :Class .
按原样执行,查询需要笛卡尔积,因为第一个模式的结果与第二个模式不相交,因此前两个模式的结果是其各自结果的笛卡尔积。这个结果将包含比我们实际需要的更多的结果,因为第三个模式限制了第一个模式的可能结果,但我们直到之后才应用此限制。但是,如果我们像这样重新排序:
?b a :Class .
?a :someProperty ?b .
?c :anotherProperty ?d .
我们仍然需要笛卡尔积来获得最终结果,因为第二个和第三个模式仍然不相交,但通过重新排序,我们限制了第二个模式的结果的大小,这意味着我们的笛卡尔积的大小会小很多。
我们还进行了一些其他优化,但我不打算将它们发布在这里,因为它开始对 SPARQL 引擎内部结构进行相当详细的讨论。如果有人对更多细节感兴趣,请发表评论或向我发送推文@RobVesse
Can somebody please demonstrate for me a more efficient Cartesian product algorithm than the one I am using currently (assuming there is one). I've looked around SO and googled a bit but can't see anything obvious so I could be missing something.
foreach (int i in is) {
foreach (int j in js) {
//Pair i and j
}
}
This is a highly simplified version of what I do in my code. The two integers are lookup keys which are used to retrieve one/more objects and all the objects from the two lookups are paired together into new objects.
This small block of code in a much larger more complex system becomes a major performance bottleneck as the dataset it's operating over scales. Some of this could likely be mitigated by improving the data structures used to store the objects and the lookups involved but the main issue I feel is still the computation of the Cartesian product itself.
Edit
So some more background on my specific usage of the algorithm to see if there may be any tricks that I can use in response to Marc's comment. The overall system is a SPARQL query engine which processes SPARQL queries over sets of Graph data, SPARQL is a pattern based language so each query consists of a series of patterns which are matched against the Graph(s). In the case where two subsequent patterns have no common variables (they are disjoint) it is necessary to compute the Cartesian product of the solutions produced by the two patterns to get the set of possible solutions for the overall query. There may be any number of patterns and I may have to compute Cartesian products multiple times which can lead to a fairly exponential expansion in possible solutions if the query is composed of a series of disjoint patterns.
Somehow from the existing answers I doubt whether there any tricks that could apply
Update
So I thought I would post an update on what I implemented in order to minimise the need to do Cartesian products and thus optimise the query engine generally. Note that it's not always possible to completely eliminate the need for products but it's nearly always possible to optimise so the size of the two sets being joined is much smaller.
Since each BGP (Basic Graph Pattern) which is a set of Triple Patterns is executed as a block (in essence) the engine is free to reorder patterns within a BGP for optimal performance. For example consider the following BGP:
?a :someProperty ?b .
?c :anotherProperty ?d .
?b a :Class .
Executed as is the query requires a cartesian product since the results of the first pattern are disjoint from the second pattern so the results of the first two patterns is a cartesian product of their individual results. This result will contain far more results than we actually need since the third pattern restricts the possible results of the first pattern but we don't apply this restriction till afterwards. But if we reorder like so:
?b a :Class .
?a :someProperty ?b .
?c :anotherProperty ?d .
We'll still need a cartesian product to get the final results since the 2nd and 3rd patterns are still disjoint but by reordering we restrict the size of the results of the 2nd pattern meaning the size of our cartesian product will be much smaller.
There are some various other optimisations we make but I'm not going to post them here as it starts to get into fairly detailed discussion of SPARQL engine internals. If anyone is interested in further details just leave a comment or send me a tweet @RobVesse
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
笛卡尔积的复杂度是O(n2),没有捷径。
在特定情况下,迭代两个轴的顺序很重要。例如,如果您的代码正在访问数组中的每个槽 - 或图像中的每个像素 - 那么您应该尝试按自然顺序访问槽。图像通常布置在“扫描线”中,因此任何 Y 上的像素都是相邻的。因此,您应该迭代外循环上的 Y 和内循环上的 X。
您是否需要笛卡尔积或是否需要更有效的算法取决于您要解决的问题。
The complexity of cartesian product is O(n2), there is no shortcut.
In specific cases, the order you iterate your two axis is important. For example, if your code is visiting every slot in an array — or every pixel in an in image — then you should try to visit the slots in natural order. An image is typically laid out in ‘scanlines’, so the pixels on any Y are adjacent. Therefore, you should iterate over the Y on the outer loop and the X on the inner.
Whether you need the cartesian product or wherther is a more efficient algorithm depends on the problem you are solving.
如果没有一些额外的知识,您无法真正改变嵌套循环的性能,但这将是特定于用途的。如果您在
is
中有n
个项目,在js
中有m
个项目,那么它总是 O( n*m)。不过,您可以更改它的外观:
这仍然是 O(n*m),但名义上有 LINQ 的额外开销(在正常使用中您不会注意到它) , 尽管)。
您在您的案例中做什么?可能有技巧......
绝对要避免的一件事是强制交叉连接进行缓冲的任何事情 -
foreach
方法很好并且不会缓冲 - 但如果将每个项目添加到List
中,然后注意内存影响。同上OrderBy
等(如果使用不当)。You can't really change the performance of a nested loop without some additional knowledge, but that would be use-specific. If you have got
n
items inis
andm
items injs
, it is always going to be O(n*m).You can change the look of it though:
This is still O(n*m), but has nominal extra overhead of LINQ (you won't notice it in normal usage, though).
What are you doing in your case? There may be tricks...
One thing to definitely avoid is anything that forces a cross-join to buffer - the
foreach
approach is fine and doesn't buffer - but if you add each item to aList<>
, then beware the memory implications. DittoOrderBy
etc (if used inappropriately).我无法提出比 O(n^2) 更好的方案,因为这是输出的大小,因此该算法不可能更快。
我可以建议的是使用另一种方法来确定是否需要计算乘积。例如,如果您只是要查询某些元素是否属于产品集
P
,那么您甚至可能不需要该产品集。您只需要有关其组成的集合的信息。事实上(伪代码)
,笛卡尔积的计算方式如下:
如果您将集合实现为散列,那么在
m
集合的笛卡尔积中查找只是在m
散列中查找免费获得。构建笛卡尔积和IsInSet
查找都需要O(m)
时间,其中m
是相乘的集合数< /strong>,并且它远小于n
——每组的大小。I can't propose anything better, than O(n^2), because that's the size of the output, and the algorithm hence can't be any faster.
What I can suggest is using another approach to whether you need to compute product. For example, you may not even need the product set
P
if only you are going to query whether certain elements belong to it. You only need the information about the sets it's composed of.Indeed (pseudocode)
where Cartesian product is calculated like this:
If you implement sets as hashes, then lookup in a cartesian product of
m
sets is just a lookup inm
hashes you get for free. Construction of the cartesian product andIsInSet
lookup each takeO(m)
time, wherem
is a number of sets you multiply, and it's much less thann
--size of each set.附加信息已添加到问题中。
如果您记录已经计算过的内容,以避免再次重复它们,则可以避免重复项 - 假设这种簿记的成本 - 哈希图或简单列表 - 小于计算重复项的成本。
C# 运行时确实非常快,但对于极其繁重的工作,您可能需要使用本机代码。
您可能还会注意到这个问题的本质相似性。如果一个产品的计算不影响任何其他产品的计算,您可以直接使用多个处理器核心来并行完成工作。查看线程池。< a href="http://msdn.microsoft.com/en-us/library/system.threading.threadpool.queueuserworkitem(VS.80).aspx" rel="nofollow noreferrer">QueueUserWorkItem。
Additional information has been added to the question.
The duplicates can be avoided if you record those you've already computed so as to avoid duplicating them again - it's assumed that the cost of such bookkeeping - a hashmap or a simple list - is less than the cost of computing a duplicate.
The C# runtime is really very fast, but for extreme heavy lifting, you might want to drop into native code.
You might also notice the essential parallelness of this problem. If the computation of a product does not impact the computation of any other product, you could straightforwardly use a multiple processor cores for doing the work in parallel. Look at ThreadPool.QueueUserWorkItem.
如果缓存局部性(或维护 j 所需的本地内存)是一个问题,您可以通过递归地平分输入数组来使算法对缓存更加友好。类似于:
请注意,此访问模式将自动充分利用所有级别的自动缓存;这种技术称为cache-oblivious算法设计。
If cache locality (or local memory required to maintain the j's) is a problem, you can make your algorithm more cache-friendly by bisecting the input arrays recursively. Something like:
Note that this access pattern will automatically take fairly good advantage of all levels of automatic cache; this kind of technique is called cache-oblivious algorithm design.
我不知道如何用 C# 编写类似 Java 的迭代器,但也许你知道并且可以从这里转移我的解决方案 自己使用 C#。
如果您的组合太大而无法将它们完全保留在内存中,这可能会很有趣。
但是,如果您按属性对集合进行过滤,则应在构建组合之前进行过滤。示例:
如果您有从 1 到 1000 的数字和随机单词并将它们组合起来,然后过滤这些组合,其中数字可被 20 整除且单词以“d”开头,则可能有 1000*(26*x)= 26000*x 个组合可供搜索。
或者,您首先过滤数字,这会为您提供 50 个数字和(如果均匀分布)1 个字符,最终只有 50*x 个元素。
I don't know how to write Java-like-Iterators in C#, but maybe you know and can transfer my solution from here to C# yourself.
It can be interesting if your combinations are too big to keep them completely in memory.
However, if you filter by attribute over the collection, you should filter before building the combination. Example:
If you have numbers from 1 to 1000 and random words and combine them, and then filter those combinations, where the number is divisible by 20 and the word starts with 'd', you could have 1000*(26*x)=26000*x combinations to search.
Or you filter the numbers first, which gives you 50 numbers, and (if equally distributed) 1 character, which are only 50*x elements in the end.