如何让这个功能更加高效?

发布于 2025-01-16 23:02:49 字数 488 浏览 1 评论 0 原文

我正在尝试创建一个小型数据库,其中包含人员列表、他们的姓名、ID 以及随着时间推移的收入。我想使用这个数据库找到他们的收入中位数。

我为人们设想的类结构是这样的

class Person {
       string name;
       string ID;
       vector<int> Incomes;
 };

整个数据库的类结构只是包含这些收入的向量。

 class Database{
             vector<Person> somepeople;
 };

假设我创建了一个名为中位收入的函数,它返回该数据库中所有人的中位收入。显而易见的方法是迭代每个人,然后迭代收入向量中的每个值以创建一个包含所有值的列表,然后从中找到中位数。然而,这需要 O(n*k) 时间。有没有办法在 O(nlogn) 或更短的时间内实现这一点,包括 C++ STL 提供的方法?

I'm trying to create a tiny database that contains a list of people, their names, ID, and their income as time passed on. Using this database I'd like to find their median income.

The class structure I have in mind for the people is something like this

class Person {
       string name;
       string ID;
       vector<int> Incomes;
 };

And the class structure for the database overall is just a vector containing these incomes.

 class Database{
             vector<Person> somepeople;
 };

Let's say I create a function called median income that returns the median income of all people in this database. The obvious method would be to iterate through every person, and then subsequently iterate through every value in the vector of incomes to create a list that contains all values and then from that finding the median. However, this would be in O(n*k) time. Is there a way to achieve this in O(nlogn) or less, including ways provided by the C++ STL?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

鹿港小镇 2025-01-23 23:02:49

你有一个非常有趣的任务!我为您从头开始完全实现了两个解决方案的代码。第一个解决方案要短得多,并且仅使用 标准 C++ 库,第二个解决方案要长几倍且更困难但比标准库解决方案快 3 倍

您可以跳过下面的整个描述文本,只运行底部提供的 C++ 代码。

第一个解决方案实现类 StdMedianFinder 来查找能够执行 3 项操作的中位数:

  1. Add(value, count) - 这允许添加不同用户的多个收入一次,收入用value表示,count表示所有用户有多少次该收入。您可以在任何时间点添加任何用户的收入,例如,如果您的数据库不时更新新用户。

  2. Del(value, count) - 这允许删除用户的收入,例如,如果用户从数据库中删除。 value 和 count 的含义与 Add() 操作中的含义相同。

  3. Median() - 此方法返回所有当前收入的中位数,这正是您在问题中想要的。方法速度非常快,无论有多少收入都会立即返回值,因为主要时间和计算消耗者是Add()和Del()。

正如您在上面看到的,库允许增量地执行所有操作,这意味着在不同的时间点您可以通过添加或删除某些用户或其部分收入来更新数据库。增量为小更新提供了非常快的速度,这意味着如果您有 1000 个收入,例如需要 100 微秒才能将它们添加到结构中,然后添加/删除另外 10 个收入,那么它们只需要初始时间的 1% - 只需1 微秒,因为 10 份收入 / 1000 份收入 = 1%

上面的类仅使用 std::map 而没有其他内容。众所周知, std::map 使用 Red- 实现排序(按键)容器黑色二叉搜索树。向我的类添加/删除一个元素大约需要向 std::map 添加/删除一个元素加上迭代器的一个前进时间(++it;--it;)。用于添加/删除一个元素的 std::map 大约使用 O(Log(N)) 时间,其中 N 是容器中已有的元素数量。因此,在我的类中添加/删除 N 元素的总时间需要 O(N * Log(N)) 时间。

在您的问题中,您指出您想要非常快速的解决方案。我的两种解决方案都提供了 O(N * Log(N)) 时间来添加 N 不同的收入。如果您有 n 个用户,每个用户有 k 收入,正如您在问题中所说,那么我的 N = n * k,这意味着时间是O(n * k * Log(n * k))。但有好消息 - 这是全职,只进行中位数的第一次计算,稍后对中位数的调用会立即给出结果,并且正如所有数据库中经常发生的那样,稍后会不时添加或删除新用户,并且如果您添加/删除具有 k 收入的用户,那么您只需要少量的额外时间 O(k * Log(n * k)) 来更新我的班级并重新计算中位数。添加单个用户时,重要不是对所有用户(整个数据库)进行整体重新计算,而只是用该单个用户的收入更新类。

对于更大、速度快 3 倍 倍的第二个解决方案,我从头开始实现了所谓的 B-Tree 不存在于标准 C++ 库中,并且使用此树我实现了 BTreeMedianFinder 类,该类与第一类 StdMedianFinder 具有完全相同的接口代码>.

重要说明 我下面的所有代码(尤其是 B-Tree)都没有经过很好的测试,并且不适合复制粘贴到生产产品中。您必须将其复制粘贴到教育项目中,或者在生产中使用之前进行一些额外的深入测试。

B-Tree的实现方式是,叶子节点中除了存储(Key,Value)(即收入和相同收入的数量)外,中间节点还存储子树的最大值和收入总数一棵子树。这允许非常快速地执行两个操作 - 给定收入值,它快速回答收入在所有收入的排序列表中的位置是什么,并且给定收入在排序列表中的所需位置,它快速回答收入的值是多少该位置(第二个操作用于查找中位数,因为中位数基本上是该位置的收入值,等于所有元素数量的一半)。

以下代码有几个相当大的测试,检查三个类 BTreeStdMedianFinderBTreeMedianFinder 的正确性和速度。程序在控制台中显示所有测试和速度测量的计时。

如果您的计算机中安装了 Boost.StackTrace 库系统它可用于显示堆栈跟踪以使异常消息提供更多信息。默认情况下,在我的代码中,通过定义 #define USE_BOOST_STACKTRACE 0 禁用此功能,您可以将其更改为 1 来启用。此外,此功能的工作仅在 Windows 上进行了全面测试,对于 Linux 系统,可能需要在包含或定义中进行一些小的修复。程序二进制文件应包含符号信息进行编译,为此在 MSVC/Z7 a> 编译器和 GCC/CLang。只有这个第 3 方库可以用作选项,除此之外,我的代码中只使用了标准 C++ 库。

如果您进行调试构建,请不要忘记定义 _DEBUG 宏,如果定义了该宏,那么我会在代码的所有类中进行很多额外的断言检查,以对代码正确性进行额外的测试并防止错误和崩溃。

下面的代码是完全可编译的,无需任何修改即可运行。如果您运行它,它将执行所有测试和速度测量并在控制台中显示它们。另请单击下面的在线尝试!链接,查看此代码在 GodBolt Linux 服务器上的实时运行情况。

在线尝试!

源代码在这里。由于 StackOveflow 对帖子大小的限制为 30000 个字符,因此在外部 PasteBin 服务中提供共享源代码。我的带有代码的帖子似乎有 41000 个字符长!您也可以点击上面的在线尝试!链接,它在GodBolt服务器上复制了相同的源代码。

输出:

'Test BTree' time 1.188 sec
'Test StdMedianFinder' time 0.532 sec
'Test BTreeMedianFinder' time 0.535 sec
'Std_Finder Add' time 1.426 sec
'BTree_Finder Add' time 0.577 sec
'Std_Tree Add' time 1.249 sec
'Std_Finder Median' time 0.001 sec
'BTree_Finder Median' time 0.321 sec
'Std_Finder Del' time 1.252 sec
'BTree_Finder Del' time 0.477 sec
'Std_Tree Del' time 1.117 sec
'Test MedianFinders Speed Total' time 6.907 sec

You have a very interesting task! I fully implemented from scratch code of two solutions for you. First solution is much shorter and uses only standard C++ library, second solution is several times longer and more difficult but 3x faster than standard library solution.

You may skip whole description text below and just run C++ code provided at the bottom.

First solution implements class StdMedianFinder for finding median that is capable of doing 3 operations:

  1. Add(value, count) - this allows to add several incomes of different users at a time, income is signified by value and count means how many occurances of this income all users have. You may add income of any user at any point in time, for example if your database is updated from time to time with new users.

  2. Del(value, count) - this allows to delete incomes of a user, for example in case if user is deleted from database. value and count mean same as in Add() operation.

  3. Median() - this method returns median of all current incomes, this is exactly what you wanted in your Question. Method is very fast, returns value immediately no matter how many incomes are there, because main time and computation consumer is Add() and Del().

As you can see above library allows to do all things incrementally, meaning that in different points in time you can update your database by adding or removing some users or parts of their incomes. Incrementality gives very fast speed for small updates, meaning that if you had 1000 incomes that took for example 100 micro-seconds to add them to structure, and later added/removed 10 more incomes then they will take just 1% of initial time - just 1 micro-second, because 10 incomes / 1000 incomes = 1%.

Class above uses only std::map and nothing else. As it is known std::map implements sorted (by keys) container using Red-Black binary search tree. Adding/removing one element to my class takes approximately time of adding/removing one element to std::map plus one advance of iterator (++it; or --it;). std::map for adding/removing one element uses approximately O(Log(N)) time, where N is amount of elements already in container. Hence total time for adding/removing N elements in my class takes O(N * Log(N)) time.

In your Question you noted that you want very fast solution. Both of my solutions give O(N * Log(N)) time for adding N different incomes. If you have n users and k incomes for each user, as you said in Question, then my N = n * k, it means time is O(n * k * Log(n * k)). But there are good news - this is full time when doing only very first computation of Median, later calls to median give immediate result, and as it often happens in all DataBases new users are added or removed later from time to time, and if you add/remove a user that has k incomes then you need only small extra time O(k * Log(n * k)) to update my class and re-compute Median. When adding single user, Important is not to do whole re-computation of all users (whole database), but only to Update class with incomes of this single user.

For second solution that is much bigger, and 3x times faster, I implemented from scratch so-called B-Tree that is not present in standard C++ library, and using this tree I implmented BTreeMedianFinder class that has exactly same interface of first class StdMedianFinder.

Important Note All my code below (and especially B-Tree) is not very well tested, and can't be suited for copy-pasting into Production products. Either you have to copy-paste it into just educational projects or do some extra deep testing before usage in Production.

B-Tree is implemented in such a way that besides (Key, Value) (which are income and count of same incomes) stored in leaf nodes, it also stores inside intermediate nodes maximum value of a sub-tree and total count of incomes in a sub-tree. This allows to do very fast two operation - given a value of income it answers fast what is the position of income inside a sorted list of all incomes, and given desired position of income inside sorted list it answers fast what is the value of income at that position (this second operation is used to find Median, because median is basically a value of income at position equal to half of number of all elements).

Following code has several quite large tests that check correctness and speed of three classes BTree, StdMedianFinder, BTreeMedianFinder. Program shows in console timings of all tests and speed measurement.

If you have Boost.StackTrace library installed in your system it can be used to show stacktrace to make exceptions messages more informative. By default in my code this feature is disabled by define #define USE_BOOST_STACKTRACE 0, which you can change to 1 to enable. Also work of this feature was tested fully only on Windows, for Linux system some small fixups might be needed inside includes or defines. Program binary should be compiled with symbols information included, for this add option /Z7 in MSVC compiler and -g in GCC/CLang. Only this 3rd-party library might be used as an option, besides it only standard C++ library was used in my code.

If you do debug build, don't forget to define _DEBUG macro, if this macro is defined than I do very many extra assertion checks in all classes of my code to do extra testing of code correctness and prevent bugs and crashes.

Code below is fully compilable and working without any modifications. If you run it executes all tests and speed measurements and shows them in console. Also click on Try it online! link below to see live run of this code on GodBolt Linux servers.

Try it online!

SOURCE CODE HERE. Due to StackOveflow limit of 30000 characters for post size, providing source code shared at external PasteBin service. My post with code appeared to be 41000 characters long! Also you may click Try it online! link above, it has same source code duplicated at GodBolt servers.

Output:

'Test BTree' time 1.188 sec
'Test StdMedianFinder' time 0.532 sec
'Test BTreeMedianFinder' time 0.535 sec
'Std_Finder Add' time 1.426 sec
'BTree_Finder Add' time 0.577 sec
'Std_Tree Add' time 1.249 sec
'Std_Finder Median' time 0.001 sec
'BTree_Finder Median' time 0.321 sec
'Std_Finder Del' time 1.252 sec
'BTree_Finder Del' time 0.477 sec
'Std_Tree Del' time 1.117 sec
'Test MedianFinders Speed Total' time 6.907 sec
夜深人未静 2025-01-23 23:02:49

好吧,只有当有两个 Incomes 向量时,您才能在 O(logk) 中找到它们合并的中位数。因为,二分搜索启用了该方法。

但除此之外,我认为无法避免线性遍历。

认为即使向量预先排序,我们也不知道它们的值如何相互比较,并且中间元素可以随机位于第 2 个或第 88 个向量中。

Well, only when there are two Incomes vectors, you can find the median of their merger in O(logk). Because, binary search enables that method.

But for more than that, I don't think there is any avoiding a linear traversal.

Think that even if the vectors are sorted beforehand, even then we don't know how their values compare against each other, and middle elements can randomly be in the 2nd or 88th vector.

北方的韩爷 2025-01-23 23:02:49

为什么不使用 std::nth_element ?这是 O(n)
如果元素个数是奇数=>中间的元素是中位数
如果元素个数是偶数=>找到迭代器左中间元素,然后找到第二半的右中间元素 min()
中位数为 *leftMiddle + (*rightMiddle - *leftMiddle) / 2;

Why not use std::nth_element ? This is O(n)
If the number of elements is odd => the middle element is the median
If the number of elements is even => find the iterator left middle element, then right middle element min() of 2nd half
median is *leftMiddle + (*rightMiddle - *leftMiddle) / 2;

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文