如何快速从排序向量中获取排序子向量

发布于 2024-10-05 05:57:17 字数 1366 浏览 2 评论 0原文

我有一个像这样的数据结构:

struct X {
  float value;
  int id;
};

这些向量(大小N(认为100000),按排序(在程序执行期间保持不变):

std::vector<X> values;

现在,我想编写一个函数

void subvector(std::vector<X> const& values, 
               std::vector<int> const& ids, 
               std::vector<X>& out /*, 
               helper data here */);

,用 的排序子集填充 out 参数,该子集由传递的 ids 给出(大小 M <N(大约是N的0.8倍)),(内存不是问题,而且会重复做,因此构建查找表(来自函数参数的辅助数据)或仅完成一次的其他内容是完全可以的)

到目前为止我的解决方案:
构建包含 id 的查找表 lut -> 中的偏移量(准备工作,因此运行时间恒定)
创建 std::vector; tmp,大小 N,填充无效 ID(N 呈线性)
对于每个 id,将 values[lut[id]] 复制到 tmp[lut[id]](在 M 中呈线性)
循环tmp,将项目复制到out(在N中呈线性),

这在N中呈线性(因为它更大)比M),但是临时变量和重复复制让我烦恼。有没有比这更快的方法?请注意,M 将接近 N,因此 O(M log N) 的情况是不利的。

编辑: http://ideone.com/xR8Vp 是上述算法的示例实现,以使所需的输出清晰并证明它在线性时间内是可行的 - 问题是关于避免临时变量或以其他方式加速它的可能性,非线性的东西并不更快:)。

I have a data structure like this:

struct X {
  float value;
  int id;
};

a vector of those (size N (think 100000), sorted by value (stays constant during the execution of the program):

std::vector<X> values;

Now, I want to write a function

void subvector(std::vector<X> const& values, 
               std::vector<int> const& ids, 
               std::vector<X>& out /*, 
               helper data here */);

that fills the out parameter with a sorted subset of values, given by the passed ids (size M < N (about 0.8 times N)), fast (memory is not an issue, and this will be done repeatedly, so building lookuptables (the helper data from the function parameters) or something else that is done only once is entirely ok).

My solution so far:
Build lookuptable lut containing id -> offset in values (preparation, so constant runtime)
create std::vector<X> tmp, size N, filled with invalid ids (linear in N)
for each id, copy values[lut[id]] to tmp[lut[id]] (linear in M)
loop over tmp, copying items to out (linear in N)

this is linear in N (as it's bigger than M), but the temporary variable and repeated copying bugs me. Is there a way to do it quicker than this? Note that M will be close to N, so things that are O(M log N) are unfavourable.

Edit: http://ideone.com/xR8Vp is a sample implementation of mentioned algorithm, to make the desired output clear and prove that it's doable in linear time - the question is about the possibility of avoiding the temporary variable or speeding it up in some other way, something that is not linear is not faster :).

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

大海や 2024-10-12 05:57:17

您可以尝试的另一种方法是使用哈希表而不是向量来查找 id:

void subvector(std::vector<X> const& values, 
               std::unordered_set<int> const& ids, 
               std::vector<X>& out) {

    out.clear();
    out.reserve(ids.size());
    for(std::vector<X>::const_iterator i = values.begin(); i != values.end(); ++i) {
        if(ids.find(i->id) != ids.end()) {
            out.push_back(*i);
        }
    }
}

这以线性时间运行,因为 unordered_set::find 是恒定的预期时间(假设我们没有问题)散列整数)。但是我怀疑它在实践中可能不如您最初描述的使用向量的方法那么快。

An alternative approach you could try is to use a hash table instead of a vector to look up ids in:

void subvector(std::vector<X> const& values, 
               std::unordered_set<int> const& ids, 
               std::vector<X>& out) {

    out.clear();
    out.reserve(ids.size());
    for(std::vector<X>::const_iterator i = values.begin(); i != values.end(); ++i) {
        if(ids.find(i->id) != ids.end()) {
            out.push_back(*i);
        }
    }
}

This runs in linear time since unordered_set::find is constant expected time (assuming that we have no problems hashing ints). However I suspect it might not be as fast in practice as the approach you described initially using vectors.

徒留西风 2024-10-12 05:57:17

由于您的向量已排序,并且您希望它的子集以相同的方式排序,我假设我们可以直接切出您想要的块,而无需重新排列它。

为什么不直接使用 find_if() 两次呢?一次查找所需范围的起点,一次查找范围的终点。这将为您提供子向量的开始和结束迭代器。使用这些迭代器构造一个新向量。矢量 构造函数 重载之一需要两个迭代器。

该算法或 partition 算法应该可以工作。

Since your vector is sorted, and you want a subset of it sorted the same way, I assume we can just slice out the chunk you want without rearranging it.

Why not just use find_if() twice. Once to find the start of the range you want and once to find the end of the range. This will give you the start and end iterators of the sub vector. Construct a new vector using those iterators. One of the vector constructor overloads takes two iterators.

That or the partition algorithm should work.

岁月染过的梦 2024-10-12 05:57:17

如果我正确理解你的问题,你实际上尝试创建一个线性时间排序算法(取决于数字 M 的输入大小)。
那是不可能的。

您当前的方法是获得可能值的排序列表。
这需要与可能值的数量 N 成线性的时间(理论上,假设地图搜索需要 O(1) 时间)。

您能做的最好的事情是,使用快速排序方法(O(MlogM) fe 快速排序、合并排序等)对值(从地图中找到)进行排序,以获取较小的 M 值,并且可能对较大的 M 值进行线性搜索。
例如,如果 N 是 100000,M 是 100,那么仅使用排序算法会快得多。

我希望你能明白我说的话。如果您还有疑问,我会尽力回答:)

编辑:(评论)
我将进一步解释我的意思。
假设您知道您的数字范围为 1 到 100。
您将它们在某处排序(实际上它们是“自然”排序的)并且您希望以排序形式获得它们的子集。
如果可以比 O(N) 或 O(MlogM) 更快地完成,排序算法将仅使用此方法进行排序。

Fe 通过拥有数字集 {5,10,3,8,9,1,7},知道它们是有序数字集 {1,2,3,4,5,6,7} 的子集, 8,9,10} 你仍然无法比 O(N) (N = 10) 或 O(MlogM) (M = 7) 更快地对它们进行排序。

If I understood your problem correctly, you actually try to create a linear time sorting algorithm (subject to the input size of numbers M).
That is NOT possible.

Your current approach is to have a sorted list of possible values.
This takes linear time to the number of possible values N (theoretically, given that the map search takes O(1) time).

The best you could do, is to sort the values (you found from the map) with a quick sorting method (O(MlogM) f.e. quicksort, mergesort etc) for small values of M and maybe do that linear search for bigger values of M.
For example, if N is 100000 and M is 100 it is much faster to just use a sorting algorithm.

I hope you can understand what I say. If you still have questions I will try to answer them :)

edit: (comment)
I will further explain what I mean.
Say you know that your numbers will range from 1 to 100.
You have them sorted somewhere (actually they are "naturally" sorted) and you want to get a subset of them in sorted form.
If it would be possible to do it faster than O(N) or O(MlogM), sorting algorithms would just use this method to sort.

F.e. by having the set of numbers {5,10,3,8,9,1,7}, knowing that they are a subset of the sorted set of numbers {1,2,3,4,5,6,7,8,9,10} you still can't sort them faster than O(N) (N = 10) or O(MlogM) (M = 7).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文