计算十亿个数字的中位数

发布于 2024-08-27 05:38:44 字数 371 浏览 6 评论 0原文

如果你有十亿个数字和一百台计算机,找到这些数字的中位数的最佳方法是什么?

我的一个解决方案是:

  • 在计算机之间平均分配该组。
  • 对它们进行排序。
  • 找出每组的中位数。
  • 按中位数对集合进行排序。
  • 一次合并两组,从最低中值到最高中值。

如果我们有 m1 <平方米< m3 ... 然后首先合并 Set1Set2,在结果集中我们可以丢弃所有低于 Set12 中位数的数字代码>(合并)。所以在任何时候我们都有相同大小的集合。顺便说一句,这不能以并行方式完成。有什么想法吗?

If you have one billion numbers and one hundred computers, what is the best way to locate the median of these numbers?

One solution which I have is:

  • Split the set equally among the computers.
  • Sort them.
  • Find the medians for each set.
  • Sort the sets on medians.
  • Merge two sets at a time from the lowest to the highest median.

If we have m1 < m2 < m3 ... then first merge Set1 and Set2 and in the resulting set we can discard all the numbers lower than the median of Set12 (merged). So at any point of time we have equal sized sets. By the way this cannot be done in a parallel manner. Any ideas?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(25

梦里的微风 2024-09-03 05:38:44

啊,我的大脑刚刚启动,我现在有了一个明智的建议。如果这是一次采访,可能就太晚了,但没关系:

机器 1 应该被称为“控制机器”,并且为了争论,它要么从所有数据开始,然后将其平等地发送到其他 99 台机器,否则数据开始在机器之间均匀分布,并且它将 1/99 的数据发送给其他机器。分区不必相等,只要接近即可。

每台其他机器都会对其数据进行排序,并以有利于首先找到较低值的方式进行排序。例如,快速排序,总是首先对分区的下部进行排序[*]。它会尽快将数据以递增的顺序写回控制机(使用异步 IO 以便继续排序,并且可能在 Nagle 打开时:进行一下实验)。

控制机在数据到达时对其执行 99 路合并,但丢弃合并的数据,只记录它所看到的值的数量。它将中位数计算为 1/2 十亿分之一和 1/2 十亿分之一值的平均值。

这面临着“群体中最慢”的问题。直到排序机发送了所有小于中位数的值后,该算法才能完成。有一个合理的机会,这样一个值在其数据包中会相当高。因此,一旦数据的初始分区完成,预计运行时间是排序 1/99 数据并将其发送回控制计算机的时间与控制器读取 1/2 数据的时间的组合。 “组合”介于最大值和这些时间的总和之间,可能接近最大值。

我的直觉是,为了通过网络发送数据比排序数据更快(更不用说仅选择中位数),它需要是一个非常快的网络。如果可以假定网络是瞬时的,那么前景可能会更好,例如,如果您有 100 个内核,可以平等地访问包含数据的 RAM。

由于网络 I/O 可能会受到限制,因此您可能可以使用一些技巧,至少对于返回控制机的数据而言。例如,排序机可以发送一条表示“100 个值小于 101”的消息,而不是发送“1,2,3,.. 100”。然后,控制机可以执行修改后的合并,其中它找到所有这些顶级值中的最小值,然后告诉所有分拣机它是什么,以便它们可以(a)告诉控制机如何许多值要“计数”低于该值,并且 (b) 从该点恢复发送排序的数据。

更一般地说,控制机可以与 99 台分拣机一起玩一个巧妙的挑战-响应猜测游戏。

不过,这涉及机器之间的往返,而我更简单的第一个版本避免了这种情况。我真的不知道如何盲目估计他们的相对表现,而且由于权衡很复杂,我想有比我自己想到的更好的解决方案,假设这是一个真正的问题。

[*] 可用堆栈允许 - 如果您没有 O(N) 额外空间,您对首先执行哪一部分的选择会受到限制。但是,如果你确实有足够的额外空间,你可以选择,如果你没有足够的空间,你至少可以使用你所拥有的东西来抄近路,通过先为前几个分区做小部分。

Ah, my brain has just kicked into gear, I have a sensible suggestion now. Probably too late if this had been an interview, but never mind:

Machine 1 shall be called the "control machine", and for the sake of argument either it starts with all the data, and sends it in equal parcels to the other 99 machines, or else the data starts evenly distributed between the machines, and it sends 1/99 of its data to each of the others. The partitions do not have to be equal, just close.

Each other machine sorts its data, and does so in a way which favours finding the lower values first. So for example a quicksort, always sorting the lower part of the partition first[*]. It writes its data back to the control machine in increasing order as soon as it can (using asynchronous IO so as to continue sorting, and probably with Nagle on: experiment a bit).

The control machine performs a 99-way merge on the data as it arrives, but discards the merged data, just keeping count of the number of values it has seen. It calculates the median as the mean of the 1/2 billionth and 1/2 billion plus oneth values.

This suffers from the "slowest in the herd" problem. The algorithm cannot complete until every value less than the median has been sent by a sorting machine. There's a reasonable chance that one such value will be quite high within its parcel of data. So once the initial partitioning of the data is complete, estimated running time is the combination of the time to sort 1/99th of the data and send it back to the control computer, and the time for the control to read 1/2 the data. The "combination" is somewhere between the maximum and the sum of those times, probably close to the max.

My instinct is that for sending data over a network to be faster than sorting it (let alone just selecting the median) it needs to be a pretty damn fast network. Might be a better prospect if the network can be presumed to be instantaneous, for example if you have 100 cores with equal access to RAM containing the data.

Since network I/O is likely to be the bound, there might be some tricks you can play, at least for the data coming back to the control machine. For example, instead of sending "1,2,3,.. 100", perhaps a sorting machine could send a message meaning "100 values less than 101". The control machine could then perform a modified merge, in which it finds the least of all those top-of-a-range values, then tells all the sorting machines what it was, so that they can (a) tell the control machine how many values to "count" below that value, and (b) resume sending their sorted data from that point.

More generally, there's probably a clever challenge-response guessing game that the control machine can play with the 99 sorting machines.

This involves round-trips between the machines, though, which my simpler first version avoids. I don't really know how to blind-estimate their relative performance, and since the trade-offs are complex, I imagine there are much better solutions out there than anything I'll think of myself, assuming this is ever a real problem.

[*] available stack permitting - your choice of which part to do first is constrained if you don't have O(N) extra space. But if you do have enough extra space, you can take your pick, and if you don't have enough space you can at least use what you do have to cut some corners, by doing the small part first for the first few partitions.

¢蛋碎的人ぎ生 2024-09-03 05:38:44
sort -g numbers | head -n 500000001 | tail -n 2 | dc -e "1 k ? ? + 2 / p"
sort -g numbers | head -n 500000001 | tail -n 2 | dc -e "1 k ? ? + 2 / p"
十六岁半 2024-09-03 05:38:44

我讨厌在这里成为逆向者,但我不认为需要排序,而且我认为任何涉及对 10 亿/100 个数字进行排序的算法都会很慢。让我们考虑一台计算机上的算法。

1) 从十亿个值中随机选择 1000 个值,并使用它们来了解数字的分布,尤其是范围。

2)不要对值进行排序,而是根据您刚刚计算的分布将它们分配到存储桶中。选择桶的数量以便计算机能够有效地处理它们,但在其他方面也应尽可能大以方便。存储桶范围应该使每个存储桶中的值数量大致相等(这对算法来说并不重要,但有助于提高效率。100,000 个存储桶可能比较合适)。请注意每个存储桶中的值的数量。这是一个 O(n) 的过程。

3)找出中位数位于哪个桶范围。这可以通过简单地检查每个桶中的总数来完成。

4) 通过检查该桶中的值找到实际中位数。如果您愿意,可以在此处使用排序,因为您可能只对 10,000 个数字进行排序。如果该存储桶中的值数量很大,那么您可以再次使用此算法,直到有足够小的数量可供排序。

这种方法通过在计算机之间划分值来实现简单的并行化。每台计算机将每个存储桶中的总计报告给执行步骤 3 的“控制”计算机。对于步骤 4,每台计算机将相关存储桶中的(排序)值发送到控制计算机(您也可以并行执行这两种算法,但这可能不值得)。

总的过程是 O(n),因为只要桶的数量足够大,步骤 3 和 4 都是微不足道的。

I hate to be the contrarian here, but I don't believe sorting is required, and I think any algorithm involving sorting a billion/100 numbers is going to be slow. Let's consider an algorithm on one computer.

1) Select 1000 values at random from the billion, and use them to get an idea of the distribution of the numbers, especially a range.

2) Instead of sorting the values, allocate them to buckets based on the distribution you just calculated. The number of buckets is chosen so that the computer can handle them efficiently, but should otherwise be as large as convenient. The bucket ranges should be so that approximately equal numbers of values go in each bucket (this isn't critical to the algorithm, but it helps efficiency. 100,000 buckets might be appropriate). Note the number of values in each bucket. This is an O(n) process.

3) Find out which bucket range the median lies. This can be done by simply examining the total numbers in each bucket.

4) Find the actual median by examining the values in that bucket. You can use a sort here if you like, since you are only sorting maybe 10,000 numbers. If the number of values in that bucket is large then you can use this algorithm again until you have a small enough number to sort.

This approach parallelizes trivially by dividing the values between the computers. Each computer reports the totals in each bucket to a 'control' computer which does step 3. For step 4 each computer sends the (sorted) values in the relevant bucket to the control computer (you can do both of those algorithms in parallel too, but it probably isn't worth it).

The total process is O(n), since both steps 3 and 4 are trivial, provided the number of buckets is large enough.

伊面 2024-09-03 05:38:44

中位数和第 99 个百分位数等订单统计数据的估计可以使用 t-digestQ-digest

使用任一算法,每个节点都会生成一个摘要,它表示本地存储的值的分布。摘要在单个节点处收集、合并(有效地求和分布),然后可以查找中值或任何其他百分位数。

elasticsearch 使用此方法,据推测,BigQuery(根据 QUANTILES 函数的描述)。

The estimation of order statistics like median and 99th percentile can be efficiently distributed with algorithms like t-digest or Q-digest.

Using either algorithm, each node produces a digest, which represents the distribution of the values stored locally. The digests are collected at a single node, merged (effectively summing the distributions), and the median or any other percentile can then be looked up.

This approach is used by elasticsearch and, presumably, BigQuery (going by the description of the QUANTILES function).

世俗缘 2024-09-03 05:38:44

对于现代计算机来说,十亿实际上是一项相当无聊的任务。我们在这里讨论的是 4 GB 的 4 字节整数……4 GB……这是某些智能手机的 RAM。

public class Median {
    public static void main(String[] args) {
        long start = System.currentTimeMillis();

        int[] numbers = new int[1_000_000_000];

        System.out.println("created array after " +  (System.currentTimeMillis() - start) + " ms");

        Random rand = new Random();
        for (int i = 0; i < numbers.length; i++) {
            numbers[i] = rand.nextInt();
        }

        System.out.println("initialized array after " + (System.currentTimeMillis() - start) + " ms");

        Arrays.sort(numbers);

        System.out.println("sorted array after " + (System.currentTimeMillis() - start) + " ms");

        if (numbers.length % 2 == 1) {
            System.out.println("median = " + numbers[numbers.length / 2 - 1]);
        } else {
            int m1 = numbers[numbers.length / 2 - 1];
            int m2 = numbers[numbers.length / 2];
            double m = ((long) m1 + m2) / 2.0;
            System.out.println("median = " + new DecimalFormat("#.#").format(m));
        }
}

我的机器上的输出:

created array after 518 ms
initialized array after 10177 ms
sorted array after 102936 ms
median = 19196

因此,这在我的机器上使用单个核心在不到两分钟的时间内完成(1:43,其中 0:10 用于生成随机数),甚至可以进行完整排序。真的没什么特别的。

对于较大的数字集来说,这无疑是一项有趣的任务。我在这里只想强调一点:10亿只是微不足道的。因此,在开始为极其简单的任务提供复杂的解决方案之前,请三思而后行;)

One billion is actually quite a boring task for a modern computer. We're talking about 4 GB worth of 4 byte integers here ... 4 GB ... that's the RAM of some smartphones.

public class Median {
    public static void main(String[] args) {
        long start = System.currentTimeMillis();

        int[] numbers = new int[1_000_000_000];

        System.out.println("created array after " +  (System.currentTimeMillis() - start) + " ms");

        Random rand = new Random();
        for (int i = 0; i < numbers.length; i++) {
            numbers[i] = rand.nextInt();
        }

        System.out.println("initialized array after " + (System.currentTimeMillis() - start) + " ms");

        Arrays.sort(numbers);

        System.out.println("sorted array after " + (System.currentTimeMillis() - start) + " ms");

        if (numbers.length % 2 == 1) {
            System.out.println("median = " + numbers[numbers.length / 2 - 1]);
        } else {
            int m1 = numbers[numbers.length / 2 - 1];
            int m2 = numbers[numbers.length / 2];
            double m = ((long) m1 + m2) / 2.0;
            System.out.println("median = " + new DecimalFormat("#.#").format(m));
        }
}

Output on my machine:

created array after 518 ms
initialized array after 10177 ms
sorted array after 102936 ms
median = 19196

So this completes on my machine within less than two minutes (1:43 of which 0:10 are to generate random numbers) using a single core and it's even doing a full sort. Nothing fancy really.

This surely is an interesting task for larger sets of numbers. I just want to make a point here: one billion is peanuts. So think twice before you start throwing complex solutions at surprisingly simple tasks ;)

苏璃陌 2024-09-03 05:38:44

这组数字

2, 3, 5, 7, 11, 13, 67, 71, 73, 79, 83, 89, 97 的

中位数是 67。

这组数字

2, 3, 5, 7, 11, 13, 67, 71, 73, 79, 83, 89

是 40。

假设问题是关于 1,000,000,000 个整数(x),其中 0 >= x <= 2,147,483,647 并且OP正在寻找 (element(499,999,999) + element(500,000,000)) / 2(如果数字已排序)。 还假设所有 100 台计算机都是平等的。

使用我的笔记本电脑和 GigE...

我发现我的笔记本电脑可以在 1.3 秒内对 10,000,000 个 Int32 进行排序。因此,粗略估计十亿个数字排序将花费 100 x 1.3 秒(2 分 10 秒);)。

在千兆位以太网上单向传输 40MB 文件的时间估计为 0.32 秒。这意味着所有计算机的排序结果将在大约 32 秒内返回(计算机 99 直到启动后 30 秒才得到他的文件)。从这里开始,不需要很长时间就可以丢弃最低的 499,999,998 个数字,添加接下来的 2 个数字并除以 2。

The median for this set of numbers

2, 3, 5, 7, 11, 13, 67, 71, 73, 79, 83, 89, 97

is 67.

The median for this set of numbers

2, 3, 5, 7, 11, 13, 67, 71, 73, 79, 83, 89

is 40.

Assuming the question was about 1,000,000,000 integers(x) where 0 >= x <= 2,147,483,647 and that the OP was looking for (element(499,999,999) + element(500,000,000)) / 2 (if the numbers were sorted). Also assuming that all 100 computers were all equal.

using my laptop and GigE...

What I found was that my laptop can sort 10,000,000 Int32's in 1.3 seconds. So a rough estimate would be that a billion number sort would take 100 x 1.3 seconds(2 minutes 10 seconds) ;).

An estimate of a one-way file transfer of a 40MB file on a gigabit Ethernet is .32 seconds. This means that the sorted results from all computers will be returned in approximately 32 seconds(computer 99 didn't get his file until 30 seconds after the start). From there it shouldn't take long to discard the lowest 499,999,998 numbers, add the next 2 and divide by 2.

苍景流年 2024-09-03 05:38:44

这可能会让人们感到惊讶,但如果数字足够小,可以容纳 32 位(或更小) - 只需进行桶排序即可!任意数量的 32 位整数只需要 16GB 的内存,并且运行时间复杂度为 O(n),对于合理的 n(例如 10 亿),这应该优于任何分布式系统。

一旦你有了排序列表,选出中位数就很简单了。事实上,您不需要构建排序列表,只需查看存储桶就可以了。

一个简单的实现如下所示。仅适用于 16 位整数,但扩展到 32 位应该很容易。

#include <stdio.h>
#include <string.h>

int main()
{
    unsigned short buckets[65536];
    int input, n=0, count=0, i=0;

    // calculate buckets
    memset(buckets, 0, sizeof(buckets));
    while (scanf("%d", &input) != EOF)
    {
        buckets[input & 0xffff]++;
        n++;
    }

    // find median 
    while (count <= n/2)
    {
        count += buckets[i++];
    }
    
    printf("median: %d\n", i-1);
    
    return 0;
}

使用包含十亿 (109) 个数字的文本文件并按 time 运行,

time ./median < billion

在我的机器上的运行时间为 1m49.293s。大部分运行时间也可能是磁盘 IO。

This might surprise people, but if the numbers are integers small enough to fit inside 32-bit (or smaller) - Just do a bucket sort! Only needs 16GB of ram for any number of 32-bit ints and runs in O(n), which should outperform any distributed systems for reasonable n, e.g. a billion.

Once you have the sorted list, it's trivial to pick out the median. In fact, you do not need to construct the sorted list, but only looking at the buckets should do it.

A simple implementation is shown below. Only works for 16-bit integers, but extension to 32-bit should be easy.

#include <stdio.h>
#include <string.h>

int main()
{
    unsigned short buckets[65536];
    int input, n=0, count=0, i=0;

    // calculate buckets
    memset(buckets, 0, sizeof(buckets));
    while (scanf("%d", &input) != EOF)
    {
        buckets[input & 0xffff]++;
        n++;
    }

    // find median 
    while (count <= n/2)
    {
        count += buckets[i++];
    }
    
    printf("median: %d\n", i-1);
    
    return 0;
}

Using a text file with a billion (109) numbers and running with time like so

time ./median < billion

yields a running time on my machine 1m49.293s. Most of the running time is probably disk IO aswell.

永不分离 2024-09-03 05:38:44

奇怪的是,我认为如果你有足够的计算机,那么排序比使用 O(n) 中值查找算法更好。 (不过,除非您的核心非常非常慢,否则我只会使用一个核心并使用 O(n) 中值查找算法仅查找 1e9 数字;不过,如果您有 1e12,那可能会不太实用。)

无论如何,假设我们有超过 log n 个内核来处理这个问题,并且我们不关心功耗,只是快速得到答案。我们进一步假设这是一台 SMP 机器,所有数据都已加载到内存中。 (例如,Sun 的 32 核机器就是这种类型。)

一个线程盲目地将列表切成相同大小的块,并告诉其他 M 个线程对它们进行排序。这些线程在 (n/M) log (n/M) 时间内努力执行此操作。然后,他们不仅返回中位数,还返回第 25 个和第 75 个百分位数(如果您选择稍微不同的数字,则反常的最坏情况会更好)。现在您拥有 4M 范围的数据。然后,您对这些范围进行排序并向上遍历列表,直到找到一个数字,这样,如果您丢弃每个小于或包含该数字的范围,您将丢弃一半的数据。这是中位数的下限。对上限执行相同的操作。这需要大约 M log M 时间,并且所有内核都必须等待它,因此实际上浪费了 M^2 log M 潜在时间。现在,您的单线程告诉其他线程将所有数据扔到范围之外(您应该在每次传递时扔掉大约一半)并重复 - 这是一个非常快的操作,因为数据已经排序。您无需重复此操作超过 log(n/M) 次,即可更快地获取剩余数据并使用标准 O(n) 中值查找器在它上面。

因此,总复杂度类似于 O((n/M) log (n/M) + M^2 log M log (n/M))。因此,如果 M >>> 的话,这比一个核心上的 O(n) 中值排序要快。 log(n/M)M^3 log M n,这对于您所描述的场景来说是正确的。

我认为这是一个非常糟糕的主意,因为它的效率很低,但它更快。

Oddly enough, I think if you have enough computers, you're better off sorting than using O(n) median-finding algorithms. (Unless your cores are very, very slow, though, I'd just use one and use an O(n) median-finding algorithm for merely 1e9 numbers; if you had 1e12, though, that might be less practical.)

Anyway, let's suppose we have more than log n cores to deal with this problem, and we don't care about power consumption, just getting the answer fast. Let's further assume that this is a SMP machine with all the data already loaded in memory. (Sun's 32-core machines are of this type, for instance.)

One thread chops the list up blindly into equal sized pieces and tells the other M threads to sort them. Those threads diligently do so, in (n/M) log (n/M) time. They then return not only their medians, but, say, their 25th and 75th percentiles as well (perverse worst cases are better if you choose slightly different numbers). Now you have 4M ranges of data. You then sort these ranges and work upwards through the list until you find a number such that, if you throw out every range that is smaller than or contains the number, you will have thrown out half your data. That's your lower bound for the median. Do the same for the upper bound. This takes something like M log M time, and all cores have to wait for it, so it's really wasting M^2 log M potential time. Now you have your single thread tell the others to toss all data outside the range (you should throw out about half on each pass) and repeat--this is a trivially fast operation since the data is already sorted. You shouldn't have to repeat this more than log(n/M) times before it's faster to just grab the remaining data and use a standard O(n) median finder on it.

So, total complexity is something like O((n/M) log (n/M) + M^2 log M log (n/M)). Thus, this is faster than O(n) median sort on one core if M >> log(n/M) and M^3 log M < n, which is true for the scenario you've described.

I think this is a really bad idea given how inefficient it is, but it is faster.

删除→记忆 2024-09-03 05:38:44

这可以比投票算法更快地完成 (n log n)

- 订单统计分布式选择算法 - O(n)
将问题简化为在未排序数组中查找第 k 个数字的原始问题。
- 计数排序直方图 O(n)
您必须假设一些有关数字范围的属性 - 该范围可以容纳在内存中吗?
- 外部合并排序 - O(n log n) - 如上所述
基本上,您可以在第一次遍历时对数字进行排序,然后在第二次遍历中找到中位数。
- 如果知道其他数字的分布
可以产生算法。

有关更多详细信息和实施,请参阅:
http://www.fusu.us /2013/07/median-in-large-set-across-1000-servers.html

This can be done faster than the algorithm voted (n log n)

- Order statistics distributed selection algorithm - O(n)
Simplify the problem to the original problem of finding the kth number in an unsorted array.
- Counting sort histogram O(n)
You have to assume some properties about the range of the numbers - can the range fit in the memory?
- External merge sort - O(n log n) - described above
You basically sort the numbers on the first pass, then find the median on the second.
- If anything is known about the distribution of the numbers other
algorithms can be produced.

For more details and implementation see:
http://www.fusu.us/2013/07/median-in-large-set-across-1000-servers.html

夜巴黎 2024-09-03 05:38:44

一台电脑足以解决问题。

但我们假设有 100 台计算机。您应该做的唯一复杂的事情是对列表进行排序。将其拆分为 100 个部分,将一个部分发送到每台计算机,让它们在那里进行排序,然后合并各个部分。

然后从排序列表的中间获取数字(即索引为 5 000 000 000)。

One computer is more than enough to solve the problem.

But let's assume that there are 100 computers. The only complex thing you should do is to sort the list. Split it to 100 parts, send one part to each computer, let them be sorted there, and merge parts after that.

Then take number from the middle of the sorted list (i.e. with index 5 000 000 000).

久随 2024-09-03 05:38:44

这取决于你的数据。最坏的情况是它是均匀分布的数字。

在这种情况下,您可以在 O(N) 时间内找到中位数,如下例所示:

假设您的数字是 2,7,5,10,1,6,4,4,6,10,4,7,1,8 ,4,9,9,3,4,3(范围为1-10)。

我们创建 3 个桶:1-3、4-7、8-10。请注意,顶部和底部的大小相同。

我们用数字填充桶,计算每个桶中有多少个落下,最大和最小

  • 低(5):2,1,1,3,3,最小1,最大3
  • 中间(10):7,5,6 ,4,4,6,4,7,4,4, min 4, max 7
  • high (5): 10, 10, 8, 9, 9, min 8, max 10

平均值落在中间的桶中,我们忽略其余的

我们创建 3 个桶:4、5-6、7。低值将从计数 5 开始,最大值为 3,高值将从最小值 8 开始,计数为 5。

对于每个数字,我们计算有多少个跌倒在低和高的桶中,最大和最小,并保留中间的桶。

  • 旧低 (5)
  • 低 (5): 4, 4, 4, 4, 4, max 4
  • 中 (3): 5,6,6
  • 高 (2): 7, 7, min 7
  • 旧高 (5)

现在我们可以直接计算中位数:我们有这样的情况,

old low    low          middle  high  old high
x x x x x  4 4 4 4 4 4   5 6 6  7 7   x x x x x

所以中位数是4.5。

假设您对分布有所了解,您可以微调如何定义范围以优化速度。在任何情况下,性能应该与 O(N) 一致,因为 1 + 1/3 + 1/9... = 1.5

由于边缘情况,您需要最小值和最大值(例如,如果中位数是最大值之间的平均值)旧低点和下一个元素)。

所有这些操作都可以并行化,你可以将1/100的数据交给每台计算机并计算每个节点中的3个桶,然后分配你保留的桶。这再次使您能够有效地使用网络,因为每个数字平均传递 1.5 次(因此 O(N))。如果您只在节点之间传递最少的数字(例如,如果节点 1 有 100 个数字,节点 2 有 150 个数字,那么节点 2 可以向节点 1 提供 25 个数字),您甚至可以击败它。

除非您对分布了解更多,否则我怀疑您在这里可以做得比 O(N) 更好,因为您实际上需要至少对元素进行一次计数。

It depends on your data. The worst case scenario is that it's uniformly distributed numbers.

In this case you can find the median in O(N) time like in this example:

Suppose your numbers are 2,7,5,10,1,6,4,4,6,10,4,7,1,8,4,9,9,3,4,3 (range is 1-10).

We create 3 buckets: 1-3, 4-7, 8-10. Note that top and bottom have equal size.

We fill the buckets with the numbers, count how many fall in each, the max and the min

  • low (5): 2,1,1,3,3, min 1, max 3
  • middle (10): 7,5,6,4,4,6,4,7,4,4, min 4, max 7
  • high (5): 10, 10, 8, 9, 9, min 8, max 10

The mean falls in the middle bucket, we disregard the rest

We create 3 buckets: 4, 5-6, 7. Low will start with a count of 5 and with a max of 3 and high with a min of 8 and a count of 5.

For each number we count how many fall in the low and high bucket, the max and the min, and keep the middle bucket.

  • old low (5)
  • low (5): 4, 4, 4, 4, 4, max 4
  • middle (3): 5,6,6
  • high (2): 7, 7, min 7
  • old high (5)

Now we can calculate the median directly: we have a situation like this

old low    low          middle  high  old high
x x x x x  4 4 4 4 4 4   5 6 6  7 7   x x x x x

so the median is 4.5.

Assuming you know a little about the distribution, you can fine tune how to define the ranges to optimize speed. In any case, the performance should go with O(N), because 1 + 1/3 + 1/9... = 1.5

You need min and max because of edge cases (e.g. if the median is the average between the max of old low and the next element).

All of these operations can be parallelized, you can give 1/100 of the data to each computer and calculate the 3 buckets in each node, then distribute the bucket you keep. This again makes you use the network efficiently because each number is passed on average 1.5 times (so O(N)). You can even beat that if you only pass the minimal numbers among nodes (e.g. if node 1 has 100 numbers and node 2 has 150 numbers, then node 2 can give 25 numbers to node 1).

Unless you know more about the distribution, I doubt you can do better than O(N) here, because you actually need to count the elements at least once.

冷月断魂刀 2024-09-03 05:38:44

一种更简单的方法是对数字进行加权。

  • 在计算机之间拆分大集合 对
  • 每个集合
  • 进行排序 迭代小集合,并计算重复元素的权重
  • 将每 2 个集合合并为 1 个集合(每个集合已排序) 更新权重
  • 不断合并集合,直到只有一个集合
  • 迭代此集合累积权重直到达到 OneBillion/2

An easier method is to have weighted numbers.

  • Split the large set among computers
  • Sort each set
  • iterate through the small-set, and calculate weights to repeated elements
  • merge each 2 sets into 1 (each is sorted already) updating weights
  • keep merging sets until you get only one set
  • iterate through this set accumulating weights until you reach OneBillion/2
青衫儰鉨ミ守葔 2024-09-03 05:38:44

将 10^9 个数字拆分为每台计算机 10^7 ~ 每台 80MB。每台计算机都对其数字进行排序。然后计算机 1 将自己的数字与计算机 2、计算机 3 和 4 等的数字进行合并排序...然后计算机 1 将一半的数字写回到 2、3 到 4 等。然后 1 对来自计算机的数字进行合并排序1,2,3,4,将它们写回。等等。根据计算机上 RAM 的大小,您可能不必在每一步将所有数字写回各个计算机,您可能能够在计算机 1 上累加几个步骤中的数字,但您需要进行数学计算。

哦,终于得到了第 500000000 个和第 500000001 个值的平均值(但检查那里有足够的 00,我没有)。

编辑:@Roman——好吧,如果你不能相信它,即使它是真的,那么我揭露这个命题的真假就没有意义。我想说的是,在比赛中,蛮力有时会打败聪明人。我花了大约 15 秒的时间设计了一种算法,我相信我可以实现该算法,该算法会起作用,并且可以适应各种输入大小和计算机数量,并且可以根据计算机和计算机的特性进行调整。网络安排。如果您或其他任何人需要 15 分钟来设计更复杂的算法,那么我有 14 分 45 秒的优势来编写我的解决方案并开始运行。

但我坦白承认这都是断言,我没有测量任何东西。

Split the 10^9 numbers, 10^7 to each computer ~ 80MB on each. Each computer sorts its numbers. Then computer 1 merge-sorts its own numbers with those from computer 2, computer 3 and 4, etc ... Then computer 1 writes half of the numbers back to 2, 3 to 4, etc. Then 1 merge sorts the numbers from computers 1,2,3,4, writes them back. And so on. Depending on the size of RAM on the computers you may get away with not writing all the numbers back to the individual computers at each step, you might be able to accumulate the numbers on computer 1 for several steps, but you do the maths.

Oh, finally get the mean of the 500000000th and 500000001st values (but check there are enough 00s in there, I haven't).

EDIT: @Roman -- well if you can't believe it even it it's true then there's no point in my revealing the truth or falsehood of the proposition. What I meant to state was that brute force sometimes beats smart in a race. It took me about 15 seconds to devise an algorithm which I am confident that I can implement, which will work, and which will be adaptable to a wide range of sizes of inputs and numbers of computers, and tunable to the characteristics of the computers and networking arrangements. If it takes you, or anyone else, say 15 minutes to devise a more sophisticated algorithm I have a 14m45s advantage to code up my solution and start it running.

But I freely admit this is all assertion, I haven't measured anything.

影子的影子 2024-09-03 05:38:44

这可以通过以下方式在使用未跨节点排序的数据(例如来自日志文件)的节点上完成。

有 1 个父节点和 99 个子节点。子节点有两个 api 调用:

  • stats():返回 min、max 和 count
  • Compare(median_guess):返回 count 匹配值、count 小于 value 和 count 大于 value

父节点在所有子节点上调用 stats(),注意所有节点的最小值和最大值。

现在可以按以下方式进行二分搜索:

  1. 将最小值和最大值四舍五入 - 这是中位数“猜测”
  2. 如果大于计数大于小于计数,则将最小值设置为猜测
  3. 如果大于count 小于小于 count,将最大值设置为猜测值
  4. 如果 count 为奇数,则最小值和最大值相等时结束
  5. 如果 count 为偶数,则当最大值 <=minimum +guess.match_count 时结束
    这可以通过以下方式在使用未排序数据(例如来自日志文件)的节点上完成。

有 1 个父节点和 99 个子节点。子节点有两个 api 调用:

  • stats():返回 min、max 和 count
  • Compare(median_guess):返回 count 匹配值、count 小于 value 和 count 大于 value

父节点在所有子节点上调用 stats(),注意所有节点的最小值和最大值。

现在可以按以下方式进行二分搜索:

  1. 将最小值和最大值四舍五入 - 这是中位数“猜测”
  2. 如果大于计数大于小于计数,则将最小值设置为猜测
  3. 如果大于count 小于小于 count,将最大值设置为猜测值
  4. 如果 count 为奇数,则最小值和最大值相等时结束
  5. 如果 count 为偶数,则当最大值 <=minimum +guess.match_count 时结束

如果 stats() 和 Compare()可以先用 O(N/Mlogn/M) 排序进行预计算,然后进行 O(N/M) 预计算,预计算的内存复杂度为 O(N)。然后你可以在恒定时间内执行compare(),所以整个事情(包括预计算)将在O(N/MlogN/M)+O(logN)中运行,

如果我犯了错误,请告诉我!

This could be done on nodes using data that is not sorted across nodes (say from log files) in the following manner.

There is 1 parent node and 99 child nodes. The child nodes have two api calls:

  • stats(): returns min, max and count
  • compare(median_guess): returns count matching value, count less than value and count greater than value

The parent node calls stats() on all child nodes, noting the minimum and maximum of all nodes.

A binary search may now be conducted in the following way:

  1. Bisect the minimum and maximum rounding down - this is the median 'guess'
  2. If the greater than count is more than the less than count, set the minimum to the guess
  3. If the greater than count is less than the less than count, set the maximum to the guess
  4. If count is odd finish when minimum and maximum are equal
  5. If count is even finish when maximum <= minimum + guess.match_count
    This could be done on nodes using unsorted data (say from log files) in the following manner.

There is 1 parent node and 99 child nodes. The child nodes have two api calls:

  • stats(): returns min, max and count
  • compare(median_guess): returns count matching value, count less than value and count greater than value

The parent node calls stats() on all child nodes, noting the minimum and maximum of all nodes.

A binary search may now be conducted in the following way:

  1. Bisect the minimum and maximum rounding down - this is the median 'guess'
  2. If the greater than count is more than the less than count, set the minimum to the guess
  3. If the greater than count is less than the less than count, set the maximum to the guess
  4. If count is odd finish when minimum and maximum are equal
  5. If count is even finish when maximum <= minimum + guess.match_count

If the stats() and compare() could be pre-calculated with a O(N/Mlogn/M) sort, then a O(N/M) pre-calculation with a memory complexity of O(N) for the pre-calculation. Then you could do compare() in constant time, so the whole thing (including pre-calculation) would run in O(N/MlogN/M)+O(logN)

Let me know if I have made a mistake!

我不会写诗 2024-09-03 05:38:44

怎么样:- 每个节点可以容纳 10 亿/100 个数字。在每个节点,可以对元素进行排序并找到中位数。求中位数的中位数。我们可以通过聚合所有节点上小于中位数中位数的数字计数,找出中位数中位数造成的 x%:y% 分割。现在要求所有节点删除小于中位数中位数的元素(以30%:70%分割为例)。删除30%的数字。 10亿的70%就是7亿。现在,所有删除少于 300 万个节点的节点都可以将这些额外的节点发送回主计算机。主计算机以这样的方式重新分配,现在所有节点将具有几乎相同数量的节点(700 万)。现在问题已减少到 7 亿个数字......继续下去,直到我们有了一个可以在一个计算机上计算的更小的集合。

How about this:- each node can take 1Billion/100 numbers. At each node the elements can be sorted and median can be found. Find the median of medians. we can, by aggregating the counts of numbers less than median-of-median on all nodes find out x%:y% split which the median-of-medians makes. Now ask all nodes to delete elements less than the median of medians( taking example of 30%:70% split).30% numbers are deleted. 70% of 1Billion is 700million. Now all nodes which deleted less than 3million nodes can send those extra nodes back to a main computer. The main computer redistributes in such a way that now all nodes will have almost equal number of nodes(7million). Now that the problem is reduced to 700million numbers.... goes on until we have a smaller set which can be computed on one comp.

滿滿的愛 2024-09-03 05:38:44

我们先来看看如何在一台机器上求n个数字的中位数:
我基本上使用分区策略。

问题:选择(n,n/2):从最小的数字中找到第n/2个数字。

您选择中间元素 k 并将数据划分为 2 个子数组。第一个包含所有元素 < k 和 2nd 包含所有元素 >= k。

如果 sizeof(1st sub-array) >= n/2,则您知道该子数组包含中位数。然后您可以丢弃第二个子阵列。解决这个问题selection(sizeof 1st sub-array,n/2)

在其他情况下,丢弃第一个子数组并解决 selection(2nd subarray , n/2 - sizeof(1st subarray))

递归地执行。

时间复杂度是 O(n) 预期时间。

现在,如果我们有很多机器,在每次迭代中,我们必须处理一个要分割的数组,我们将数组分配到不同的机器中。每台机器处理它们的数组块,并将摘要发送回集线器控制机器,即第一个子数组的大小和第二个子数组的大小。集线器机器将摘要相加并决定要使用哪个子数组(第一个或第二个)进一步处理选择的第二个参数并将其发送回每台机器。
等等。

这个算法可以使用map-reduce来非常巧妙地实现吗?

看起来怎么样?

Let's first work out how to find a median of n numbers on a single machine:
I am basically using partitioning strategy.

Problem :selection(n,n/2) : Find n/2 th number from least number.

You pick say middle element k and partition data into 2 sub arrays. the 1st contains all elements < k and 2nd contains all elements >= k.

if sizeof(1st sub-array) >= n/2, you know that this sub-array contains the median. You can then throw-off the 2nd sub-array. Solve this problem selection(sizeof 1st sub-array,n/2).

In else case, throw off this 1st subarray and solve selection(2nd subarray , n/2 - sizeof(1st subarray))

Do it recursively.

time complexity is O(n) expected time.

Now if we have many machines, in each iteration, we have to process an array to split, we distribute the array into diff machines. Each machine processes their chunk of array and sends back the summary to hub controlling machine i.e. size of 1st subarray and size of 2nd subarray. The hub machines adds up summaries and decide which subarray (1st or 2nd) to process further and 2nd parameter of selection and sends it back to each machine.
and so on.

This algorithm can be implemented very neatly using map reduce?

How does it look?

沉鱼一梦 2024-09-03 05:38:44

我认为史蒂夫·杰索普的回答将是最快的。

如果网络数据传输大小是瓶颈,那么这里有另一种方法。

Divide the numbers into 100 computers (10 MB each). 
Loop until we have one element in each list     
    Find the meadian in each of them with quickselect which is O(N) and we are processing in parallel. The lists will be partitioned at the end wrt median.
    Send the medians to a central computer and find the median of medians. Then send the median back to each computer. 
    For each computer, if the overall median that we just computed is smaller than its median, continue in the lower part of the list (it is already partitioned), and if larger in the upper part.
When we have one number in each list, send them to the central computer and find and return the median.

I think Steve Jessop's answer will be the fastest.

If the network data transfer size is the bottleneck, here is another approach.

Divide the numbers into 100 computers (10 MB each). 
Loop until we have one element in each list     
    Find the meadian in each of them with quickselect which is O(N) and we are processing in parallel. The lists will be partitioned at the end wrt median.
    Send the medians to a central computer and find the median of medians. Then send the median back to each computer. 
    For each computer, if the overall median that we just computed is smaller than its median, continue in the lower part of the list (it is already partitioned), and if larger in the upper part.
When we have one number in each list, send them to the central computer and find and return the median.
捂风挽笑 2024-09-03 05:38:44

我会这样做:

一开始,所有 100 个数都找出最高和最低的数字;每台计算机都有其查询的数据库/文件的一部分;

当找到最大和最小的数字时,一台计算机读取数据,并将每个数字均匀地分配给其余的 99 个数字;数字按等间隔分布; (一个可能从 -1 亿到 0,另一个 - 从 0 到 1 亿,等等);

当接收到数字时,99台计算机中的每一台都已经对它们进行了排序;

然后,找到中位数就很容易了……看看每台计算机有多少个数字,将它们全部相加(有多少个数字的总和,而不是数字本身),除以2;计算出该数字在哪台计算机中以及在哪个索引处;

:)

瞧看来这里有很多混乱;中位数 - 是排序后的数字列表中间的数字!

I would do it like this:

in the beginning all 100 work to find the highest and the lowest number; each of the computer has his part of the database/file which it queries;

when the highest and lowest numbers are found, one computer reads the data, and distributes each number, evenly, to the rest of the 99; the numbers are distributed by equal intervals; (one may take from -100 million to 0, another - from 0 to 100 million, etc);

While receiving numbers, each of the 99 of the computers already sorts them;

Then, it's easy to find the median... See how many numbers has each computer, add all of them (the sum of how many numbers there are, not the numbers themselves), divide by 2; calculate in which computer is the number, and at which index;

:) voilla

P.S. Seems there's a lot of confusion here; the MEDIAN - is the NUMBER IN THE MIDDLE OF A SORTED LIST OF NUMBERS!

弄潮 2024-09-03 05:38:44

您可以使用锦标赛树方法来查找中位数。
我们可以创建一棵有 1000 个叶子节点的树,这样每个叶子节点都是一个数组。
然后我们在不同的数组之间进行 n/2 次锦标赛。n/2 次锦标赛后根上的值就是结果。

http://www.geeksforgeeks.org/tournament-tree-and-binary-堆/

You can use the tournament tree method for finding the median.
We can create a tree with 1000 leave nodes such that each leaf node is an array.
We then conduct n/2 tournaments between the different arrays.The value on the root after the n/2 tournaments is the result.

http://www.geeksforgeeks.org/tournament-tree-and-binary-heap/

心的憧憬 2024-09-03 05:38:44

如果数字不是不同的,并且只属于某个范围,即重复,那么我想到的一个简单的解决方案是将数字平均分配给99台机器,并保留一台机器作为主机。现在,每台机器都会迭代其给定的数字,并将每个数字的计数存储在哈希集中。每次分配给该特定计算机的数字集中重复该数字时,它都会更新其在哈希集中的计数。

然后所有机器将它们的哈希集返回给主机。主机组合哈希集,对哈希集中找到的相同密钥的计数进行求和。例如,machine#1 的哈希集有一个条目 ("1",7),而 machine#2 的哈希集有一个条目 ("1",9),因此主机在组合哈希集时会生成一个条目(“1”,16),等等。

合并哈希集后,只需对键进行排序,现在您可以轻松地从排序后的哈希集中找到第 (n/2) 项和第 (n+2/2) 项。

如果十亿个数字不同,则此方法不会有任何好处。

If the numbers are not distinct, and only belong to a certain range, that is they are repeated, then a simple solution that comes to my mind is to distribute the numbers among 99 machines equally, and keep one machine as the master. Now every machine iterates over its given numbers, and stores the count of each number in a hash set. Each time the number gets repeated in the set of numbers allotted to that particular computer, it updates its count in the hash set.

All the machines then return their hash set to the master machine. The master machine combines the hash sets, summing the count of the same key found in a hash set. For example machine#1's hash set had an entry of ("1",7), and machine#2's hash set had an entry of ("1",9), so the master machine when combing the hash sets makes an entry of ("1", 16), and so on.

Once the hash sets have been merged, then just sort the keys, and now you can easily find the (n/2)th item and the (n+2/2)th item, from the sorted hash set.

This method won't be beneficial if the billion numbers are distinct.

失而复得 2024-09-03 05:38:44

好吧,假设您知道不同整数的数量(例如)40 亿,那么您可以将它们放入 64k 个存储桶中,并从集群中的每台计算机(100 台计算机)获取每个存储桶的分布式计数。将所有这些计数结合起来。现在,找到具有中位数的桶,这次只要求目标桶中包含 64k 元素的桶。这需要对“集群”进行 O(1)(特别是 2)次查询。 :D

Well, suppose you know that the number of distinct integers is (say) 4 billion, then you can bucket them into 64k buckets and get a distributed count for each bucket from each machine in the cluster(100 computers). Combine all these counts. Now, find the bucket which has the median, and this time only ask for buckets for the 64k elements that would lie in your target bucket. This requires O(1) (specifically 2) queries over your "cluster". :D

逆光飞翔i 2024-09-03 05:38:44

毕竟其他人已经提出了我的一分钱价值:

在一台机器上查找中位数是 O(N): https://en.wikipedia.org/wiki/Selection_algorithm

向 100 台机器发送 N 个号码也是 O(N)。所以,为了让使用 100 台机器变得有趣,要么通信必须相对较快,要么 N 太大以至于单台机器无法处理,而 N/100 是可以的,或者我们只想考虑数学问题而不关心数学问题。数据通信。

简而言之,我假设在合理的范围内,我们可以发送/分发数字而不影响效率分析。

然后考虑以下方法,其中一台机器被指定为某些常规处理的“主机”。这会相对较快,因此“master”也参与每台机器执行的常见任务。

  1. 每台机器接收 N/100 个数字,计算自己的中位数并将该信息发送给主机。
  2. 主设备编译所有不同中值的排序列表并将其发送回每台机器,定义一系列有序的存储桶(在每台机器上相同),一个用于每个中值(一个单值存储桶),一个用于每个中间值之间的间隔相邻的中线。当然,对于低于最低中值和高于最高中值的值,也存在低端和高端存储桶。
  3. 每台机器都会计算每个桶中有多少个数字,并将该信息传回主服务器。
  4. 主设备确定哪个存储桶包含中值、有多少个较低值(总共)低于该存储桶以及有多少个高于该存储桶。
  5. 如果所选存储桶是单值存储桶(中位数之一),否则所选存储桶仅包含 1 个(N 个奇数)或 2 个(N 个偶数)值,我们就完成了。否则,我们通过以下(明显)修改重复上述步骤:
  6. 仅将所选存储桶中的数字从主服务器(重新)分配到 100 台机器,而且
  7. 我们不会(在每台机器上)计算中位数,而是第 k 个值,我们会考虑从总数中丢弃了多少个较高的数字,以及多少个较低的数字。从概念上讲,每台机器也有其丢弃的低/高数字的份额,并在计算(概念上)包括丢弃数字(其份额)的集合中的新中位数时考虑到这一点。

时间复杂度:

  1. 稍加思考就会让您相信,在每一步中,要分析的值总数都会减少至少两倍(2 是一个相当病态的情况;您可能会期望明显更好的减少)。由此我们得到:
  2. 假设找到中位数(或第 k 个值),即 O(N),需要 c*N 时间,其中前因子 c 不会随 N​​ 变化太大,因此我们可以将其视为常数目前,我们最多会在 2*c*N/100 时间内得到最终结果。因此,使用 100 台机器可以使我们的加速系数达到 100/2(至少)。
  3. 正如最初所说:在机器之间进行数字通信所需的时间可能会使简单地在一台机器上完成所有操作变得更有吸引力。然而,如果我们采用分布式方法,所有步骤中要通信的数字总数不会超过2*N(第一次N,第二次<=N/2,<=一半)第三个,依此类推)。

My penny worth, after all that has already been brought up by others:

Finding the median on a single machine is O(N): https://en.wikipedia.org/wiki/Selection_algorithm.

Sending N numbers to 100 machines is also O(N). So, in order to make using 100 machines interesting, either the communication must be relatively fast, or N is so large that a single machine cannot handle it while N/100 is doable, or we just want to consider the mathematical problem without bothering about datacommunication.

To cut things short I'll assume therefore that, within reasonable limits, we can send/distribute the numbers without affecting the efficiency analysis.

Consider then the following approach, where one machine is assigned to be the "master" for some general processing. This will be comparatively fast, so the "master" also participates in the common tasks that each machine performs.

  1. Each machine receives N/100 of the numbers, computes its own median and sends that information to the master.
  2. The master compiles a sorted list of all distinct medians and sends that back to each machine, defining an ordered sequence of buckets (on each machine the same), one for each median value (a single-value bucket) and one for each interval between adjacent medians. Of course there are also the lower-end and higher-end buckets for values below the lowest median and above the hightest.
  3. Each machine computes how many numbers fall in each bucket and communicates that information back to the master.
  4. The master determines which bucket contains the median, how many lower values (in total) fall below that bucket, and how many above.
  5. If the selected bucket is a single-value bucket (one of the medians) orelse the selected bucket contains only 1 (N odd) or 2 (N even) values we're done. Otherwise we repeat the steps above with the following (obvious) modifications:
  6. Only the numbers from the selected bucket are (re)distributed from the master to the 100 machines, and moreover
  7. We're not going to compute (on each machine) the median, but the k-th value, where we take into account how many higher numbers have been discarded from the total, and how many lower numbers. Conceptually each machine has also its share of the discarded low/high numbers and takes that into account when computing the new median in the set that (conceptually) includes (its share of) the discarded numbers.

Time-complexity:

  1. A little thinking will convince you that on each step the total number of values to analyse is reduced by a factor at least two (2 would be a rather sick case; you may expect a significantly better reduction). From this we get:
  2. Assuming that finding the median (or k-th value), which is O(N), takes c*N time where the prefactor c does not vary too wildly with N so that we can take it as a constant for the moment, we'll get our final result in at most 2*c*N/100 time. Using 100 machines gives us, therefore, a speedup factor of 100/2 (at least).
  3. As remarked initially: the time involved in communicating the numbers between the machines may make it more attractive to simply do everything on one machine. However, IF we go for the distributed approach, the total count of numbers to be communicated in all steps together will not exceed 2*N (N for the first time, <=N/2 the second time, <= half of that the third, and so on).
素年丶 2024-09-03 05:38:44
  1. 将 10 亿个数字分成 100 台机器。每台机器会有 10^7 个数字。

  2. 对于机器的每个传入号码,将号码存储在频率图中,
    数量->数数。还要存储每台机器中的最小号码。

  3. 查找每台机器中的中位数:从每台机器中的最小数开始,对计数进行求和,直到达到中位数索引。每台机器的中位数约为。小于和大于 5*10^6 的数字。

  4. 找到所有中位数的中位数,该中位数将小于或大于大约。 50*10^7 个数字,这是 10 亿个数字的中位数。

现在对第二步进行一些优化:将计数存储在可变位数组中,而不是存储在频率图中。例如: 假设从机器中的最小数量开始,这些是频率计数:

[min number] - 8 count
[min+1 number] - 7 count
[min+2 number] - 5 count

上面的内容可以存储在位数组中,如下所示:

[min number] - 10000000
[min+1 number] - 1000000
[min+2 number] - 10000

请注意,每台机器总共将花费大约 10^7 位,因为每台机器仅处理 10^ 7 个数字。 10^7bits = 1.25*10^6 字节,即 1.25MB

因此,采用上述方法,每台机器将需要 1.25MB 的空间来计算本地中位数。中位数的中位数可以根据这 100 个局部中位数计算得出,从而得出 10 亿个数字的中位数。

  1. Divide the 1 billion numbers into 100 machines. Each machine will have 10^7 numbers.

  2. For each incoming number to a machine, store the number in a frequency map,
    number -> count. Also store the min number in each machine.

  3. Find median in each machine: starting from min number in each machine, sum the counts until median index is reached. The median in each machine, will be the approx. lesser and greater than 5*10^6 numbers.

  4. Find median of all medians, which will be lesser and greater than approx. 50*10^7 numbers, which is the median of 1 billion numbers.

Now some optimization of 2nd step: Instead of storing in a frequency map, store the counts in a variable bit array. For example: Lets say starting from min number in a machine, these are frequency counts:

[min number] - 8 count
[min+1 number] - 7 count
[min+2 number] - 5 count

The above can be stored in bit array as:

[min number] - 10000000
[min+1 number] - 1000000
[min+2 number] - 10000

Note that altogether it will cost about 10^7 bits for each machine, since each machine only handles 10^7 numbers. 10^7bits = 1.25*10^6 bytes, which is 1.25MB

So with the above approach each machine will need 1.25MB of space to compute local median. And median of medians can be computed from those 100 local medians, resulting in median of 1 billion numbers.

烟柳画桥 2024-09-03 05:38:44

我建议一种计算近似中位数的方法。 :) 如果这10亿个数字是随机排列的,我想我可以随机选取10亿个数字的1/100或1/10,用100台机器对它们进行排序,然后选取它们的中位数。或者让我们将十亿个数字分成 100 个部分,让每台机器随机选取每个部分的 1/10,计算它们的中值。之后我们就有了 100 个数字,我们可以更容易地计算这 100 个数字的中位数。只是一个建议,我不确定它在数学上是否正确。但我认为你可以向数学不太好的经理展示结果。

I suggest a method to calculate approximately the Median. :) If these one billion numbers are in a randomly order, I think I can pick 1/100 or 1/10 of one billion number randomly, sort them with 100 machine, then pick the median of them. Or let's split billion numbers in 100 parts, let each machine pick 1/10 of each part randomly, calculate the median of them. After that we have 100 numbers and we can calculate the median of the 100 number easier. Just a suggestion, I'm not sure if it's mathematically correct. But I think you can show the result to a not-so-good-at-math manager.

楠木可依 2024-09-03 05:38:44

Steve Jessop 的答案是错误的:

考虑以下四组:

{2, 4, 6, 8, 10}

{21, 21, 24, 26, 28}

{12, 14, 30, 32, 34}

{16, 18, 36, 38, 40}

中位数为 21,包含在第二组中。

四组的中位数分别为 6, 24, 30, 36,总中位数为 27。

所以经过第一次循环后,四组将变成:

{6, 8, 10}

{24, 26, 28}

{12, 14, 30}

{16, 18, 36}

21 已经被错误地丢弃了。

该算法仅支持有两个组的情况。

Steve Jessop's answer is wrong:

consider the following four groups:

{2, 4, 6, 8, 10}

{21, 21, 24, 26, 28}

{12, 14, 30, 32, 34}

{16, 18, 36, 38, 40}

The median is 21, which is contained in the second group.

The median of the four groups are 6, 24, 30, 36, The total median is 27.

So after the first loop, the four groups will become:

{6, 8, 10}

{24, 26, 28}

{12, 14, 30}

{16, 18, 36}

The 21 is already wrongly discarded.

This algorithm only support the case when there are two groups.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文