这个数组比较问题的最佳算法是什么?
解决以下问题最有效的速度算法是什么?
给定 6 个数组,D1、D2、D3、D4、D5 和 D6,每个数组包含 6 个数字,例如:
D1[0] = number D2[0] = number ...... D6[0] = number
D1[1] = another number D2[1] = another number ....
..... .... ...... ....
D1[5] = yet another number .... ...... ....
给定第二个数组 ST1,包含 1 个数字:
ST1[0] = 6
给定第三个数组 ans,包含 6 个数字:
ans[0] = 3, ans[1] = 4, ans[2] = 5, ......ans[5] = 8
用作数组 D1、D2 的索引,D3,D4,D5 和 D6,从 0 到 ST1[0] 中存储的数字减一的数字,在本例中为 6,因此从 0 到 6-1,将 ans 数组与每个 D 数组进行比较。如果在同一索引处的任何 D 中未找到一个或多个 ans 编号,则结果应为 0;如果在同一索引处的某个 D 中找到所有 ans 编号,则结果应为 1。也就是说,如果某些 ans[i] 不等于任何 DN[i],则返回 0;如果每个 ans[i] 等于某个 DN[i],则返回 1 ]。
到目前为止我的算法是:
我试图尽可能地让所有东西都解开。
EML := ST1[0] //number contained in ST1[0]
EML1 := 0 //start index for the arrays D
While EML1 < EML
if D1[ELM1] = ans[0]
goto two
if D2[ELM1] = ans[0]
goto two
if D3[ELM1] = ans[0]
goto two
if D4[ELM1] = ans[0]
goto two
if D5[ELM1] = ans[0]
goto two
if D6[ELM1] = ans[0]
goto two
ELM1 = ELM1 + 1
return 0 //If the ans[0] number is not found in either D1[0-6], D2[0-6].... D6[0-6] return 0 which will then exclude ans[0-6] numbers
two:
EML1 := 0 start index for arrays Ds
While EML1 < EML
if D1[ELM1] = ans[1]
goto three
if D2[ELM1] = ans[1]
goto three
if D3[ELM1] = ans[1]
goto three
if D4[ELM1] = ans[1]
goto three
if D5[ELM1] = ans[1]
goto three
if D6[ELM1] = ans[1]
goto three
ELM1 = ELM1 + 1
return 0 //If the ans[1] number is not found in either D1[0-6], D2[0-6].... D6[0-6] return 0 which will then exclude ans[0-6] numbers
three:
EML1 := 0 start index for arrays Ds
While EML1 < EML
if D1[ELM1] = ans[2]
goto four
if D2[ELM1] = ans[2]
goto four
if D3[ELM1] = ans[2]
goto four
if D4[ELM1] = ans[2]
goto four
if D5[ELM1] = ans[2]
goto four
if D6[ELM1] = ans[2]
goto four
ELM1 = ELM1 + 1
return 0 //If the ans[2] number is not found in either D1[0-6], D2[0-6].... D6[0-6] return 0 which will then exclude ans[0-6] numbers
four:
EML1 := 0 start index for arrays Ds
While EML1 < EML
if D1[ELM1] = ans[3]
goto five
if D2[ELM1] = ans[3]
goto five
if D3[ELM1] = ans[3]
goto five
if D4[ELM1] = ans[3]
goto five
if D5[ELM1] = ans[3]
goto five
if D6[ELM1] = ans[3]
goto five
ELM1 = ELM1 + 1
return 0 //If the ans[3] number is not found in either D1[0-6], D2[0-6].... D6[0-6] return 0 which will then exclude ans[0-6] numbers
five:
EML1 := 0 start index for arrays Ds
While EML1 < EML
if D1[ELM1] = ans[4]
goto six
if D2[ELM1] = ans[4]
goto six
if D3[ELM1] = ans[4]
goto six
if D4[ELM1] = ans[4]
goto six
if D5[ELM1] = ans[4]
goto six
if D6[ELM1] = ans[4]
goto six
ELM1 = ELM1 + 1
return 0 //If the ans[4] number is not found in either D1[0-6], D2[0-6].... D6[0-6] return 0 which will then exclude ans[0-6] numbers
six:
EML1 := 0 start index for arrays Ds
While EML1 < EML
if D1[ELM1] = ans[5]
return 1 ////If the ans[1] number is not found in either D1[0-6].....
if D2[ELM1] = ans[5] return 1 which will then include ans[0-6] numbers
return 1
if D3[ELM1] = ans[5]
return 1
if D4[ELM1] = ans[5]
return 1
if D5[ELM1] = ans[5]
return 1
if D6[ELM1] = ans[5]
return 1
ELM1 = ELM1 + 1
return 0
作为选择的语言,它将是纯c
What is the most efficient for speed algorithm to solve the following problem?
Given 6 arrays, D1,D2,D3,D4,D5 and D6 each containing 6 numbers like:
D1[0] = number D2[0] = number ...... D6[0] = number
D1[1] = another number D2[1] = another number ....
..... .... ...... ....
D1[5] = yet another number .... ...... ....
Given a second array ST1, containing 1 number:
ST1[0] = 6
Given a third array ans, containing 6 numbers:
ans[0] = 3, ans[1] = 4, ans[2] = 5, ......ans[5] = 8
Using as index for the arrays D1,D2,D3,D4,D5 and D6, the number that goes from 0, to the number stored in ST1[0] minus one, in this example 6, so from 0 to 6-1, compare the ans array against each D array. The result should be 0 if one or more ans numbers are not found in any D at the same index and should be 1 if all ans numbers are found in some D at the same index. That is, return 0 if some ans[i] doesn't equal any DN[i] and return 1 if every ans[i] equals some DN[i].
My algorithm so far is:
I tried to keep everything unlooped as much as possible.
EML := ST1[0] //number contained in ST1[0]
EML1 := 0 //start index for the arrays D
While EML1 < EML
if D1[ELM1] = ans[0]
goto two
if D2[ELM1] = ans[0]
goto two
if D3[ELM1] = ans[0]
goto two
if D4[ELM1] = ans[0]
goto two
if D5[ELM1] = ans[0]
goto two
if D6[ELM1] = ans[0]
goto two
ELM1 = ELM1 + 1
return 0 //If the ans[0] number is not found in either D1[0-6], D2[0-6].... D6[0-6] return 0 which will then exclude ans[0-6] numbers
two:
EML1 := 0 start index for arrays Ds
While EML1 < EML
if D1[ELM1] = ans[1]
goto three
if D2[ELM1] = ans[1]
goto three
if D3[ELM1] = ans[1]
goto three
if D4[ELM1] = ans[1]
goto three
if D5[ELM1] = ans[1]
goto three
if D6[ELM1] = ans[1]
goto three
ELM1 = ELM1 + 1
return 0 //If the ans[1] number is not found in either D1[0-6], D2[0-6].... D6[0-6] return 0 which will then exclude ans[0-6] numbers
three:
EML1 := 0 start index for arrays Ds
While EML1 < EML
if D1[ELM1] = ans[2]
goto four
if D2[ELM1] = ans[2]
goto four
if D3[ELM1] = ans[2]
goto four
if D4[ELM1] = ans[2]
goto four
if D5[ELM1] = ans[2]
goto four
if D6[ELM1] = ans[2]
goto four
ELM1 = ELM1 + 1
return 0 //If the ans[2] number is not found in either D1[0-6], D2[0-6].... D6[0-6] return 0 which will then exclude ans[0-6] numbers
four:
EML1 := 0 start index for arrays Ds
While EML1 < EML
if D1[ELM1] = ans[3]
goto five
if D2[ELM1] = ans[3]
goto five
if D3[ELM1] = ans[3]
goto five
if D4[ELM1] = ans[3]
goto five
if D5[ELM1] = ans[3]
goto five
if D6[ELM1] = ans[3]
goto five
ELM1 = ELM1 + 1
return 0 //If the ans[3] number is not found in either D1[0-6], D2[0-6].... D6[0-6] return 0 which will then exclude ans[0-6] numbers
five:
EML1 := 0 start index for arrays Ds
While EML1 < EML
if D1[ELM1] = ans[4]
goto six
if D2[ELM1] = ans[4]
goto six
if D3[ELM1] = ans[4]
goto six
if D4[ELM1] = ans[4]
goto six
if D5[ELM1] = ans[4]
goto six
if D6[ELM1] = ans[4]
goto six
ELM1 = ELM1 + 1
return 0 //If the ans[4] number is not found in either D1[0-6], D2[0-6].... D6[0-6] return 0 which will then exclude ans[0-6] numbers
six:
EML1 := 0 start index for arrays Ds
While EML1 < EML
if D1[ELM1] = ans[5]
return 1 ////If the ans[1] number is not found in either D1[0-6].....
if D2[ELM1] = ans[5] return 1 which will then include ans[0-6] numbers
return 1
if D3[ELM1] = ans[5]
return 1
if D4[ELM1] = ans[5]
return 1
if D5[ELM1] = ans[5]
return 1
if D6[ELM1] = ans[5]
return 1
ELM1 = ELM1 + 1
return 0
As language of choice, it would be pure c
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我对原始发布者提供的算法进行了直接的简单 C 实现。它位于此处
正如其他人建议的那样,要做的第一件事就是汇总代码。展开对于速度来说并不是很好,因为它会导致代码缓存未命中。我首先滚动内部循环并得到这个。然后我滚动了外循环并删除了现在无用的 goto 并得到了下面的代码。
编辑:我多次更改了 C 代码,因为即使很简单,在使用 CUDA (CUDA 似乎对错误不是很详细)。这就是为什么下面的代码使用全局变量......这只是一个简单的实现。我们还没有追求速度。它充分说明了过早的优化。如果我们甚至无法让它发挥作用,为什么还要费力让它变得更快呢?我想仍然存在问题,因为如果我相信维基百科的文章,CUDA 似乎对您可以工作的代码施加了许多限制。也许我们应该使用 float 而不是 int ?
这很有趣,因为我们可以理解代码在做什么。顺便说一下,在做这个打包工作时,我纠正了原来问题中的几个奇怪的地方。我认为这是拼写错误,因为它在全球背景下根本不合逻辑。
- goto 总是跳到两个(它应该已经进步了)
- 最后一次测试检查了 ans[0] 而不是 ans[5]
请标记,如果我在上述原始代码应该做什么的假设中是错误的,并且您的原始算法没有拼写错误,请纠正我。
代码对 ans 中的每个值执行的操作是检查它是否存在于二维数组中。如果有任何数字丢失,则返回 0。如果找到所有数字,则返回 1。
要获得真正的快速代码,我要做的不是用 C 实现它,而是用另一种语言(如 python(或 C++))实现,其中 set 是基本数据标准库提供的结构。然后,我将使用数组的所有值构建一个集合(即 O(n)),并检查搜索到的数字是否存在于集合中(即 O(1))。至少从算法的角度来看,最终的实现应该比现有代码更快。
下面是 Python 示例,因为它确实很简单(打印 true/false 而不是 1/0,但你明白了):
这是使用集合的可能的 C++ 实现:
我们做了一些性能假设:ans 的内容应该排序,否则我们应该构造它,否则,我们假设 D1..D6 的内容将在算法调用之间发生变化。因此,我们不必为它构造一个集合(因为集合构造是 O(n) ,无论如何,如果 D1..D6 发生变化,我们不会获得任何东西)。但是,如果我们使用相同的 D1..D6 多次调用算法,即 ans 发生变化,我们应该做相反的事情,并将 D1..D6 转换为我们保持可用的更大的集合。
如果我坚持 CI 可以这样做:
因为数据大小确实是这里很小,我们也可以尝试进行微观优化。在这里工资可能会更高。不确定。
EDIT2:CUDA 支持的 C 子集有硬性限制。最受限制的是我们不应该使用指向主内存的指针。必须考虑到这一点。它解释了为什么当前代码不起作用。最简单的更改可能是依次为每个数组 D1..D6 调用它。为了保持简短并避免函数调用成本,我们可以使用宏或内联函数。我将发布一个提案。
I did a direct trivial C implementation of the algorithm provided by the original poster. It is here
As other proposed the first thing to do is to roll up the code. Unrolling is not really good for speed as it lead to code cache misses. I began by rolling inner loops and got this. Then I rolled the outer loop and removed the now useless gotos and got code below.
EDIT: I changed several time the C code because even as simple as it is there seems to be problems when JIT compiling or executing it with CUDA (and CUDA seems not to be very verbose about errors). That is why the piece of code below use globals... and that is just trivial implementation. We are not yet going for speed. It says much about premature optimization. Why bother to make it fast if we can't even make it work ? I guess there is still issues as CUDA seems to impose many restrictions on the code you can make work if I believe the Wikipedia article. Also maybe we should use float instead of int ?
Now that is interesting, because we can understand what code is doing. By the way doing this packing job I corrected several oddities in the original question. I believe it was typos as it wasn't logical at all in the global context.
- goto always jump to two (it should have progressed)
- the last test checked ans[0] instead of ans[5]
please Mark, correct me if I'm wrong in the above asumptions on what the original code should do and your original algorithm is typo free.
What the code is doing it for each value in ans check it is present in a two dimentional array. If any number miss it returns 0. If all numbers are found it returns 1.
What I would do to get a real fast code is not to implement it in C but in another language like python (or C++) where set is a basic data structure provided by standard libraries. Then I would build a set with all the values of the arrays (that is O(n)) and check if numbers searched are present in set or not (that is O(1)). The final implementation should be faster than the existing code at least from an algorithmic point of view.
Python example is below as it is really trivial (print true/false instead of 1/0 but you get the idea):
Here is a possible C++ implementation using sets:
We make some performance hypothesis : the content of ans should be sorted or we should construct it otherwise, we suppose that content of D1..D6 will change between calls to algo. Hence we do not bother constructing a set for it (as set construction is O(n) anyway we wouldn't gain anything if D1..D6 are changing). But if we call several times algo with the same D1..D6 and that is ans that change we should do the opposite and transform D1..D6 into one larger set that we keep available.
If I stick to C I could do it as follow:
As data size are really small here , we could also try going for micro optimizations. It could pay better here. Don't know for sure.
EDIT2: there is hard restrictions on the subset of C supported by CUDA. The most restrictive one is that we shouldn't use pointers to main memory. That will have to be taken into account. It explains why the current code does not work. The simplest change is probably to call it for every array D1..D6 in turn. To keep it short and avoid function call cost we may use a macro or an inline function. I will post a proposal.
我对你的问题有点困惑,但我想我已经足够帮助你开始了。
接下来您可能应该使用嵌套的 for 循环。每个循环将对应于
D
的维度。请记住,索引的顺序很重要。在 C 中保持简洁的最简单方法是记住,即使D
具有多个维度(并且计算结果为指针),D[i]
也是一个有效的表达式。到一行:子数组)。如果您无法将独立的 D 数组更改为一个多维数组,您可以轻松创建一个指针数组,其成员指向每个数组的头并达到相同的效果。
然后,在确定当前的
D[i]
与ans
不匹配后,可以使用break 语句跳出内循环。I was a little bit confused by your question, but I think I have it enough to help you get started.
Next you should probably use nested for loops. Each loop will correspond to a dimension of
D
. Remember that the order of your indexes matters. The easiest way to keep it straight in C is to remember thatD[i]
is a valid expression even whenD
has more than one dimension (and would evaluate to a pointer to a row : a sub array).If you can't change the independent D arrays to one multidimensional array you could easily make a pointer array whose members point to the heads of each of those arrays and achieve the same effect.
Then you can use the break statement to break out of the inner loop after you have determined that the current
D[i]
doesn't match theans
.由于只有 36 个值需要比较,最有效的方法是根本不使用 CUDA。
只需使用 CPU 循环即可。
如果您更改输入,我将更改我的答案。
With only 36 values to compare, the most efficient approach would be to not use CUDA at all.
Just use a CPU loop.
If you change your inputs, I'll change my answer.
如果数字的范围有限,那么创建一个位数组可能会更容易,如下所示:
这将始终在 O(6 * ST1 + 6) 中运行。现在,这样做的缺点是首先要遍历最多 36 个数组,然后检查六个值。如果有一个强有力的先决条件,即数字将大部分出现,则可以反转测试并提供提前退出:
这最多将在 O(6 * ST1 + 6) 中运行,最多在 O(6 + 1 )如果第一个数组中的第一个数字覆盖了所有
ans
(并且ans
是相同数字的六倍)。请注意,位掩码为零的测试可以在每个数组之后(就像现在一样)或在每个元素之后(这种方式涉及更多检查,但也需要在找到所有数字时更早截止)。在 CUDA 环境中,算法的第一个版本可能会更快,因为它涉及更少的分支,并且大多数循环(除了 ST1 的循环)可以自动展开。然而,如果数字的范围是无限的,我们可以做别的事情。由于 ans 和所有数组中最多只有 7 * 6 = 42 个不同的数字,因此可以将它们映射到 42 个不同的数字并使用 64 位整数作为位掩码。但可以说,这种数字到整数的映射对于测试来说已经足够了,并且可以完全跳过这个测试。
另一种方法是对数组进行排序并简单地计算各个数字的覆盖范围:
复制的时间复杂度为 O(n),排序的时间复杂度为 O(36 log 36),唯一的时间复杂度为 O(n) (其中 n 为 6 * ST1)和用于搜索的 O(n log n)(其中,如果使用
unique
,则 n 可以小于 6 * ST1)。因此,整个算法以线性算数时间运行。请注意,这不涉及任何动态内存分配,因此甚至适用于 GPU 平台(必须实现排序和端口std::unique()
和std::lower_bound( )
,但所有这些都是非常简单的函数)。In case the range of the numbers is limited, it would be probably easier to make a bit array, like this:
This will always run in O(6 * ST1 + 6). Now this has the disadvantage of first going through up to 36 arrays and then checking against six values. If there is a strong precondition that the numbers will be mostly present, it is possible to reverse the test and provide an early exit:
This will run at most in O(6 * ST1 + 6), at best in O(6 + 1) if the first number in the first array covers all of
ans
(andans
is six times the same number). Note that the test for bit mask being zero can be either after each array (as it is now) or after each element (that way involves more checking but also earlier cutoff when all the numbers are found). In context of CUDA, the first version of the algorithm would likely be faster, as it involves fewer branches and most of the loops (except the one for ST1) can be automatically unrolled.However, if the range of the numbers is unlimited, we could do something else. Since there are only up to 7 * 6 = 42 different numbers in ans and all the arrays, it would be possible to map those to 42 different numbers and use a 64-bit integer for a bit mask. But arguably this mapping of numbers to integers would already be enough for the test and it would be possible to skip this test altogether.
Another way to do it would be to sort the arrays and simply count coverage of the individual numbers:
This will run in O(n) for copying, O(36 log 36) for sorting, optionally O(n) for
unique
(where n is 6 * ST1) and O(n log n) for searching (where n can be less than 6 * ST1 ifunique
is employed). The whole algorithm therefore runs in linearithmic time. Note that this does not involve any dynamic memory allocation and as such is suitable even for GPU platforms (one would have to implement sorting and portstd::unique()
andstd::lower_bound()
, but all those are farily simple functions).