获得列表中第一个大于 x 的值的有效方法?

发布于 2024-10-17 15:54:36 字数 531 浏览 1 评论 0原文

我有两个排序数组,HaystackNeedles。我需要迭代 Needles,并且每次在 Haystack 中找到值大于该 Needle 的第一个点,以便执行下一步。

例如:

double [] dHaystack = { 1.2, 2.6, 7.0, 9.3, 19.4 }
double [] dNeedles  = { 1.4, 6.4, 6.5, 7.0, 10.3 }

//  expected indices     0    1    1    2    3    

所以我应该得到的索引是等于或低于针值的第一个索引。

显而易见的方法是从每个 Needle 的大海捞针开始迭代,或者从最后找到的索引开始迭代(因为 Needles 也是排序的)。

但我大脑的一部分在喊“二分!”。因为编译器会发现比简单的块读取和迭代更难优化,所以二分实际上会更快吗?需要一个非常长的干草堆才值得吗?

I have two sorted arrays, Haystack and Needles. I need to iterate over Needles, and each time find the first point in Haystack with a value larger than that Needle, in order to perform the next step.

For example:

double [] dHaystack = { 1.2, 2.6, 7.0, 9.3, 19.4 }
double [] dNeedles  = { 1.4, 6.4, 6.5, 7.0, 10.3 }

//  expected indices     0    1    1    2    3    

So the index I should get is the first index equal to or lower than the needle value.

The obvious approach is just to iterate from the beginning of the haystack for each needle, or to iterate onward from the last found index (as Needles is also sorted).

But part of my brain is shouting "bisection!". Would a bisection actually be faster here, since the compiler will find it harder to optimise than a simple block read and iteration? Would it need an incredibly long Haystack to be worthwhile?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

落墨 2024-10-24 15:54:36

您需要考虑以下情况:

n*lg(m) n*lg(m) n*lg(m) n*lg(m) n+m,

其中n是Needle的大小,m是Haystack的大小。

因此,这完全取决于 n 和 m 值的各种组合。

You need to consider the scenario,

n*lg(m) < n+m,

Where n is the size of Needle and m is the size of Haystack.

Therefore, it all depends on the various combination of values of n and m.

傾旎 2024-10-24 15:54:36

使用 std::upper_bound,它对于随机访问迭代器来说是 O(log n) 的,并且以最短和最简单的代码提供了您所需要的。

在担心微小的性能之前,请测试当前的代码(也许还可以测试替代方案)而不是做出假设。特别要注意的是,您可以在每次迭代中从最后找到的索引开始搜索(upper_bound 的第一个参数)。

// Available in Boost, C++0x, and many other places.  Implementation copied
// here for the sake of the example.
template<class T, int N>
T* end(T (&a)[N]) {
  return a + N;
}

void example() {
  double haystack[] = {1.2, 2.6, 7.0, 9.3, 19.4};
  double needles[] = {1.4, 6.4, 6.5, 7.0, 10.3};
  double *begin = haystack;
  for (double *n = needles; n != end(needles); ++n) {
    double *found = std::upper_bound(begin, end(haystack), *n);
    if (found == end(haystack)) break;
    std::cout << *n << " at index " << (found - haystack) << '\n';
    begin = found;
  }
}

Use std::upper_bound, which is O(log n) for random access iterators and provides exactly what you need in the shortest and simplest code.

Before you worry about minute performance, test your current code (and maybe test alternatives) instead of making assumptions. In particular, note you can start searching (the first parameter to upper_bound) from the last found index on each iteration.

// Available in Boost, C++0x, and many other places.  Implementation copied
// here for the sake of the example.
template<class T, int N>
T* end(T (&a)[N]) {
  return a + N;
}

void example() {
  double haystack[] = {1.2, 2.6, 7.0, 9.3, 19.4};
  double needles[] = {1.4, 6.4, 6.5, 7.0, 10.3};
  double *begin = haystack;
  for (double *n = needles; n != end(needles); ++n) {
    double *found = std::upper_bound(begin, end(haystack), *n);
    if (found == end(haystack)) break;
    std::cout << *n << " at index " << (found - haystack) << '\n';
    begin = found;
  }
}
旧街凉风 2024-10-24 15:54:36

显而易见的方法就是从最后找到的索引开始迭代...(因为 Needles 也是排序的)。

是的。

但我大脑的一部分正在喊“二分!”。因为编译器会发现比简单的块读取和迭代更难优化,所以二分实际上会更快吗?它需要一个非常长的干草堆才值得吗?

我不认为编译器优化是一个问题(它只是删除了不必要的工作),而是实际固有的、必要的工作量。如果两组的大小相似,那么我会坚持使用显而易见的方法。如果干草堆比针集大得多,那么二分甚至插值可能会产生稍微更好的性能。除非这对您的应用程序至关重要,否则您不太可能注意到差异,如果是,您应该进行基准测试,特别是因为您可能可以使用 std::set 和 upper 或 lower 快速获得工作实现绑定(我永远不记得我需要哪个 - 不要经常使用),如果您的库支持的话,也许可以使用最后一个位置作为起始位置的提示。

The obvious approach is just to iterate... onward from the last found index (as Needles is also sorted).

Yes.

But part of my brain is shouting "bisection!". Would a bisection actually be faster here, since the compiler will find it harder to optimise than a simple block read and iteration? Would it need an incredibly long Haystack to be worthwhile?

I don't think the compiler optimisation's an issue (it just removes unnecessary work) so much as the amount of actual inherent, necessary work. If both sets are similar in size then I'd stick with the obvious approach. If the haystack is massively larger than the needles set, then bisection or even interpolation might yield slightly better performance. Unless this is crucial to your app you're unlikely to notice the differences, and if it is you should benchmark, particularly as you can presumably get a working implementation quickly using std::set and upper or lower bound (I can never remember which I'll need - don't use the often enough), maybe using the last position as a hint at starting location if your library supports that.

緦唸λ蓇 2024-10-24 15:54:36

std::upper_bound 将为您提供第一个严格更大的元素的迭代器,或者如果没有一个适用,则为集合的“结束”

。 upper_bound 接受开始和结束的迭代器,结束是超过集合末尾的一个。如果您要迭代不断增加的搜索值列表,您当然不需要遍历整个集合,但您的“开始”可以进一步向右移动。

当然,对于只有 5 个元素的大海捞针,使用什么搜索算法并不重要,但如果它变得非常大,使用线性搜索可能会非常慢,特别是在针很少的情况下。

在这种情况下,两种尺寸确实很重要。例如,如果您的搜索空间 N 很大,但搜索的项目数量 (M) 很小,那么 O(M log N) 实际上要小得多。 (例如,M = 20,N = 16K,则 log N = 15,M log N 为 300)与 O(M + N) 相比,在本例中为 16K。如果 M 的大小与 N 大致相同,则 O(M log N) 实际上比 O(N) 差很多。

因此,根据集合的大小,您可以选择使用哪种算法。

std::upper_bound will give you the iterator to first element strictly greater, or "end" of the collection if none of them apply

upper_bound takes iterators for begin and end, end being one past the end of the collection. If you are iterating through an increasing list of search values, you would of course not need to run through the entire collection but your "begin" can shift further to the right.

Of course with a haystack of just 5 elements it doesn't really matter what search algorithm you use, but if it got very large, using a linear search would be potentially very slow, particular if there were very few needles.

This is a situation where it does really matter what both sizes are. For example if your search space N is large but the number of items being searched (M) is small then O(M log N) really is a lot smaller. (eg M = 20, N = 16K then log N = 15 and M log N is 300) compared with O(M + N) which in this case is 16K. If M is approximately the same size as N then O(M log N) is effectively a lot worse than O(N).

Therefore depending on the sizes of your collections you can choose which algorithm to use.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文