获得列表中第一个大于 x 的值的有效方法？

发布于 2024-10-17 15:54:36 字数 531 浏览 7 评论 0原文

我有两个排序数组，Haystack 和 Needles。我需要迭代 Needles，并且每次在 Haystack 中找到值大于该 Needle 的第一个点，以便执行下一步。

例如：

double [] dHaystack = { 1.2, 2.6, 7.0, 9.3, 19.4 }
double [] dNeedles  = { 1.4, 6.4, 6.5, 7.0, 10.3 }

//  expected indices     0    1    1    2    3

所以我应该得到的索引是等于或低于针值的第一个索引。

显而易见的方法是从每个 Needle 的大海捞针开始迭代，或者从最后找到的索引开始迭代（因为 Needles 也是排序的）。

但我大脑的一部分在喊“二分！”。因为编译器会发现比简单的块读取和迭代更难优化，所以二分实际上会更快吗？需要一个非常长的干草堆才值得吗？

原文

I have two sorted arrays, Haystack and Needles. I need to iterate over Needles, and each time find the first point in Haystack with a value larger than that Needle, in order to perform the next step.

For example:

double [] dHaystack = { 1.2, 2.6, 7.0, 9.3, 19.4 }
double [] dNeedles  = { 1.4, 6.4, 6.5, 7.0, 10.3 }

//  expected indices     0    1    1    2    3

So the index I should get is the first index equal to or lower than the needle value.

The obvious approach is just to iterate from the beginning of the haystack for each needle, or to iterate onward from the last found index (as Needles is also sorted).

But part of my brain is shouting "bisection!". Would a bisection actually be faster here, since the compiler will find it harder to optimise than a simple block read and iteration? Would it need an incredibly long Haystack to be worthwhile?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

落墨 2024-10-24 15:54:36

您需要考虑以下情况：

n*lg(m) n*lg(m) n*lg(m) n*lg(m) n+m，

其中n是Needle的大小，m是Haystack的大小。

因此，这完全取决于 n 和 m 值的各种组合。

回复收藏 0 原文

傾旎 2024-10-24 15:54:36

使用 std::upper_bound，它对于随机访问迭代器来说是 O(log n) 的，并且以最短和最简单的代码提供了您所需要的。

在担心微小的性能之前，请测试当前的代码（也许还可以测试替代方案）而不是做出假设。特别要注意的是，您可以在每次迭代中从最后找到的索引开始搜索（upper_bound 的第一个参数）。

// Available in Boost, C++0x, and many other places.  Implementation copied
// here for the sake of the example.
template<class T, int N>
T* end(T (&a)[N]) {
  return a + N;
}

void example() {
  double haystack[] = {1.2, 2.6, 7.0, 9.3, 19.4};
  double needles[] = {1.4, 6.4, 6.5, 7.0, 10.3};
  double *begin = haystack;
  for (double *n = needles; n != end(needles); ++n) {
    double *found = std::upper_bound(begin, end(haystack), *n);
    if (found == end(haystack)) break;
    std::cout << *n << " at index " << (found - haystack) << '\n';
    begin = found;
  }
}

Use std::upper_bound, which is O(log n) for random access iterators and provides exactly what you need in the shortest and simplest code.

Before you worry about minute performance, test your current code (and maybe test alternatives) instead of making assumptions. In particular, note you can start searching (the first parameter to upper_bound) from the last found index on each iteration.

// Available in Boost, C++0x, and many other places.  Implementation copied
// here for the sake of the example.
template<class T, int N>
T* end(T (&a)[N]) {
  return a + N;
}

void example() {
  double haystack[] = {1.2, 2.6, 7.0, 9.3, 19.4};
  double needles[] = {1.4, 6.4, 6.5, 7.0, 10.3};
  double *begin = haystack;
  for (double *n = needles; n != end(needles); ++n) {
    double *found = std::upper_bound(begin, end(haystack), *n);
    if (found == end(haystack)) break;
    std::cout << *n << " at index " << (found - haystack) << '\n';
    begin = found;
  }
}

回复收藏 0 原文

旧街凉风 2024-10-24 15:54:36

显而易见的方法就是从最后找到的索引开始迭代...（因为 Needles 也是排序的）。

是的。

但我大脑的一部分正在喊“二分！”。因为编译器会发现比简单的块读取和迭代更难优化，所以二分实际上会更快吗？它需要一个非常长的干草堆才值得吗？

我不认为编译器优化是一个问题（它只是删除了不必要的工作），而是实际固有的、必要的工作量。如果两组的大小相似，那么我会坚持使用显而易见的方法。如果干草堆比针集大得多，那么二分甚至插值可能会产生稍微更好的性能。除非这对您的应用程序至关重要，否则您不太可能注意到差异，如果是，您应该进行基准测试，特别是因为您可能可以使用 std::set 和 upper 或 lower 快速获得工作实现绑定（我永远不记得我需要哪个 - 不要经常使用），如果您的库支持的话，也许可以使用最后一个位置作为起始位置的提示。

回复收藏 0 原文

緦唸λ蓇 2024-10-24 15:54:36

std::upper_bound 将为您提供第一个严格更大的元素的迭代器，或者如果没有一个适用，则为集合的“结束”

。 upper_bound 接受开始和结束的迭代器，结束是超过集合末尾的一个。如果您要迭代不断增加的搜索值列表，您当然不需要遍历整个集合，但您的“开始”可以进一步向右移动。

当然，对于只有 5 个元素的大海捞针，使用什么搜索算法并不重要，但如果它变得非常大，使用线性搜索可能会非常慢，特别是在针很少的情况下。

在这种情况下，两种尺寸确实很重要。例如，如果您的搜索空间 N 很大，但搜索的项目数量 (M) 很小，那么 O(M log N) 实际上要小得多。（例如，M = 20，N = 16K，则 log N = 15，M log N 为 300）与 O(M + N) 相比，在本例中为 16K。如果 M 的大小与 N 大致相同，则 O(M log N) 实际上比 O(N) 差很多。

因此，根据集合的大小，您可以选择使用哪种算法。

回复收藏 0 原文

~没有更多了~