我正在尝试创建一个小型数据库,其中包含人员列表、他们的姓名、ID 以及随着时间推移的收入。我想使用这个数据库找到他们的收入中位数。
我为人们设想的类结构是这样的
class Person {
string name;
string ID;
vector<int> Incomes;
};
整个数据库的类结构只是包含这些收入的向量。
class Database{
vector<Person> somepeople;
};
假设我创建了一个名为中位收入的函数,它返回该数据库中所有人的中位收入。显而易见的方法是迭代每个人,然后迭代收入向量中的每个值以创建一个包含所有值的列表,然后从中找到中位数。然而,这需要 O(n*k) 时间。有没有办法在 O(nlogn) 或更短的时间内实现这一点,包括 C++ STL 提供的方法?
I'm trying to create a tiny database that contains a list of people, their names, ID, and their income as time passed on. Using this database I'd like to find their median income.
The class structure I have in mind for the people is something like this
class Person {
string name;
string ID;
vector<int> Incomes;
};
And the class structure for the database overall is just a vector containing these incomes.
class Database{
vector<Person> somepeople;
};
Let's say I create a function called median income that returns the median income of all people in this database. The obvious method would be to iterate through every person, and then subsequently iterate through every value in the vector of incomes to create a list that contains all values and then from that finding the median. However, this would be in O(n*k) time. Is there a way to achieve this in O(nlogn) or less, including ways provided by the C++ STL?
发布评论
评论(3)
你有一个非常有趣的任务!我为您从头开始完全实现了两个解决方案的代码。第一个解决方案要短得多,并且仅使用 标准 C++ 库,第二个解决方案要长几倍且更困难但比标准库解决方案快
3 倍
。您可以跳过下面的整个描述文本,只运行底部提供的 C++ 代码。
第一个解决方案实现类
StdMedianFinder
来查找能够执行 3 项操作的中位数:Add(value, count)
- 这允许添加不同用户的多个收入一次,收入用value
表示,count
表示所有用户有多少次该收入。您可以在任何时间点添加任何用户的收入,例如,如果您的数据库不时更新新用户。Del(value, count)
- 这允许删除用户的收入,例如,如果用户从数据库中删除。 value 和 count 的含义与 Add() 操作中的含义相同。Median()
- 此方法返回所有当前收入的中位数,这正是您在问题中想要的。方法速度非常快,无论有多少收入都会立即返回值,因为主要时间和计算消耗者是Add()和Del()。正如您在上面看到的,库允许增量地执行所有操作,这意味着在不同的时间点您可以通过添加或删除某些用户或其部分收入来更新数据库。增量为小更新提供了非常快的速度,这意味着如果您有 1000 个收入,例如需要 100 微秒才能将它们添加到结构中,然后添加/删除另外 10 个收入,那么它们只需要初始时间的 1% - 只需1 微秒,因为
10 份收入 / 1000 份收入 = 1%
。上面的类仅使用 std::map 而没有其他内容。众所周知, std::map 使用 Red- 实现排序(按键)容器黑色二叉搜索树。向我的类添加/删除一个元素大约需要向 std::map 添加/删除一个元素加上迭代器的一个前进时间(
++it;
或--it;)。用于添加/删除一个元素的 std::map 大约使用
O(Log(N))
时间,其中N
是容器中已有的元素数量。因此,在我的类中添加/删除N
元素的总时间需要O(N * Log(N))
时间。在您的问题中,您指出您想要非常快速的解决方案。我的两种解决方案都提供了
O(N * Log(N))
时间来添加N
不同的收入。如果您有n
个用户,每个用户有k
收入,正如您在问题中所说,那么我的N = n * k
,这意味着时间是O(n * k * Log(n * k))
。但有好消息 - 这是全职,只进行中位数的第一次计算,稍后对中位数的调用会立即给出结果,并且正如所有数据库中经常发生的那样,稍后会不时添加或删除新用户,并且如果您添加/删除具有k
收入的用户,那么您只需要少量的额外时间O(k * Log(n * k))
来更新我的班级并重新计算中位数。添加单个用户时,重要不是对所有用户(整个数据库)进行整体重新计算,而只是用该单个用户的收入更新类。对于更大、速度快
3 倍
倍的第二个解决方案,我从头开始实现了所谓的 B-Tree 不存在于标准 C++ 库中,并且使用此树我实现了BTreeMedianFinder
类,该类与第一类StdMedianFinder
具有完全相同的接口代码>.重要说明 我下面的所有代码(尤其是 B-Tree)都没有经过很好的测试,并且不适合复制粘贴到生产产品中。您必须将其复制粘贴到教育项目中,或者在生产中使用之前进行一些额外的深入测试。
B-Tree的实现方式是,叶子节点中除了存储(Key,Value)(即收入和相同收入的数量)外,中间节点还存储子树的最大值和收入总数一棵子树。这允许非常快速地执行两个操作 - 给定收入值,它快速回答收入在所有收入的排序列表中的位置是什么,并且给定收入在排序列表中的所需位置,它快速回答收入的值是多少该位置(第二个操作用于查找中位数,因为中位数基本上是该位置的收入值,等于所有元素数量的一半)。
以下代码有几个相当大的测试,检查三个类
BTree
、StdMedianFinder
、BTreeMedianFinder
的正确性和速度。程序在控制台中显示所有测试和速度测量的计时。如果您的计算机中安装了 Boost.StackTrace 库系统它可用于显示堆栈跟踪以使异常消息提供更多信息。默认情况下,在我的代码中,通过定义
#define USE_BOOST_STACKTRACE 0
禁用此功能,您可以将其更改为1
来启用。此外,此功能的工作仅在 Windows 上进行了全面测试,对于 Linux 系统,可能需要在包含或定义中进行一些小的修复。程序二进制文件应包含符号信息进行编译,为此在 MSVC/Z7 a> 编译器和 GCC/CLang。只有这个第 3 方库可以用作选项,除此之外,我的代码中只使用了标准 C++ 库。如果您进行调试构建,请不要忘记定义
_DEBUG
宏,如果定义了该宏,那么我会在代码的所有类中进行很多额外的断言检查,以对代码正确性进行额外的测试并防止错误和崩溃。下面的代码是完全可编译的,无需任何修改即可运行。如果您运行它,它将执行所有测试和速度测量并在控制台中显示它们。另请单击下面的
在线尝试!
链接,查看此代码在 GodBolt Linux 服务器上的实时运行情况。在线尝试!
源代码在这里。由于 StackOveflow 对帖子大小的限制为 30000 个字符,因此在外部 PasteBin 服务中提供共享源代码。我的带有代码的帖子似乎有 41000 个字符长!您也可以点击上面的
在线尝试!
链接,它在GodBolt服务器上复制了相同的源代码。输出:
You have a very interesting task! I fully implemented from scratch code of two solutions for you. First solution is much shorter and uses only standard C++ library, second solution is several times longer and more difficult but
3x
faster than standard library solution.You may skip whole description text below and just run C++ code provided at the bottom.
First solution implements class
StdMedianFinder
for finding median that is capable of doing 3 operations:Add(value, count)
- this allows to add several incomes of different users at a time, income is signified byvalue
andcount
means how many occurances of this income all users have. You may add income of any user at any point in time, for example if your database is updated from time to time with new users.Del(value, count)
- this allows to delete incomes of a user, for example in case if user is deleted from database. value and count mean same as in Add() operation.Median()
- this method returns median of all current incomes, this is exactly what you wanted in your Question. Method is very fast, returns value immediately no matter how many incomes are there, because main time and computation consumer is Add() and Del().As you can see above library allows to do all things incrementally, meaning that in different points in time you can update your database by adding or removing some users or parts of their incomes. Incrementality gives very fast speed for small updates, meaning that if you had 1000 incomes that took for example 100 micro-seconds to add them to structure, and later added/removed 10 more incomes then they will take just 1% of initial time - just 1 micro-second, because
10 incomes / 1000 incomes = 1%
.Class above uses only std::map and nothing else. As it is known std::map implements sorted (by keys) container using Red-Black binary search tree. Adding/removing one element to my class takes approximately time of adding/removing one element to std::map plus one advance of iterator (
++it;
or--it;
). std::map for adding/removing one element uses approximatelyO(Log(N))
time, whereN
is amount of elements already in container. Hence total time for adding/removingN
elements in my class takesO(N * Log(N))
time.In your Question you noted that you want very fast solution. Both of my solutions give
O(N * Log(N))
time for addingN
different incomes. If you haven
users andk
incomes for each user, as you said in Question, then myN = n * k
, it means time isO(n * k * Log(n * k))
. But there are good news - this is full time when doing only very first computation of Median, later calls to median give immediate result, and as it often happens in all DataBases new users are added or removed later from time to time, and if you add/remove a user that hask
incomes then you need only small extra timeO(k * Log(n * k))
to update my class and re-compute Median. When adding single user, Important is not to do whole re-computation of all users (whole database), but only to Update class with incomes of this single user.For second solution that is much bigger, and
3x
times faster, I implemented from scratch so-called B-Tree that is not present in standard C++ library, and using this tree I implmentedBTreeMedianFinder
class that has exactly same interface of first classStdMedianFinder
.Important Note All my code below (and especially B-Tree) is not very well tested, and can't be suited for copy-pasting into Production products. Either you have to copy-paste it into just educational projects or do some extra deep testing before usage in Production.
B-Tree is implemented in such a way that besides (Key, Value) (which are income and count of same incomes) stored in leaf nodes, it also stores inside intermediate nodes maximum value of a sub-tree and total count of incomes in a sub-tree. This allows to do very fast two operation - given a value of income it answers fast what is the position of income inside a sorted list of all incomes, and given desired position of income inside sorted list it answers fast what is the value of income at that position (this second operation is used to find Median, because median is basically a value of income at position equal to half of number of all elements).
Following code has several quite large tests that check correctness and speed of three classes
BTree
,StdMedianFinder
,BTreeMedianFinder
. Program shows in console timings of all tests and speed measurement.If you have Boost.StackTrace library installed in your system it can be used to show stacktrace to make exceptions messages more informative. By default in my code this feature is disabled by define
#define USE_BOOST_STACKTRACE 0
, which you can change to1
to enable. Also work of this feature was tested fully only on Windows, for Linux system some small fixups might be needed inside includes or defines. Program binary should be compiled with symbols information included, for this add option/Z7
in MSVC compiler and-g
in GCC/CLang. Only this 3rd-party library might be used as an option, besides it only standard C++ library was used in my code.If you do debug build, don't forget to define
_DEBUG
macro, if this macro is defined than I do very many extra assertion checks in all classes of my code to do extra testing of code correctness and prevent bugs and crashes.Code below is fully compilable and working without any modifications. If you run it executes all tests and speed measurements and shows them in console. Also click on
Try it online!
link below to see live run of this code on GodBolt Linux servers.Try it online!
SOURCE CODE HERE. Due to StackOveflow limit of 30000 characters for post size, providing source code shared at external PasteBin service. My post with code appeared to be 41000 characters long! Also you may click
Try it online!
link above, it has same source code duplicated at GodBolt servers.Output:
好吧,只有当有两个
Incomes
向量时,您才能在 O(logk) 中找到它们合并的中位数。因为,二分搜索启用了该方法。但除此之外,我认为无法避免线性遍历。
认为即使向量预先排序,我们也不知道它们的值如何相互比较,并且中间元素可以随机位于第 2 个或第 88 个向量中。
Well, only when there are two
Incomes
vectors, you can find the median of their merger in O(logk). Because, binary search enables that method.But for more than that, I don't think there is any avoiding a linear traversal.
Think that even if the vectors are sorted beforehand, even then we don't know how their values compare against each other, and middle elements can randomly be in the 2nd or 88th vector.
为什么不使用 std::nth_element ?这是 O(n)
如果元素个数是奇数=>中间的元素是中位数
如果元素个数是偶数=>找到迭代器左中间元素,然后找到第二半的右中间元素 min()
中位数为 *leftMiddle + (*rightMiddle - *leftMiddle) / 2;
Why not use std::nth_element ? This is O(n)
If the number of elements is odd => the middle element is the median
If the number of elements is even => find the iterator left middle element, then right middle element min() of 2nd half
median is *leftMiddle + (*rightMiddle - *leftMiddle) / 2;