查找数组中最常见的条目

发布于 2024-07-09 07:44:32 字数 151 浏览 7 评论 0原文

给定一个长度最大为 232 的 32 位无符号整数数组,其属性是数组中超过一半的条目等于 N,对于某些 32 位无符号整数 N . 查找 N 只查看数组中的每个数字一次,最多使用 2 kB 内存。

你的解决方案必须是确定性的,并保证找到 N 个。

You are given a 32-bit unsigned integer array with length up to 232, with the property that more than half of the entries in the array are equal to N, for some 32-bit unsigned integer N. Find N looking at each number in the array only once and using at most 2 kB of memory.

Your solution must be deterministic, and guaranteed to find N.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

ぇ气 2024-07-16 07:44:33

这是流算法中的一个标准问题(其中您有一个巨大的(可能是无限的)数据流),您必须从该流中计算一些统计数据,并通过该流一次。


显然,您可以通过散列或排序来处理它,但是对于潜在的无限流,您显然会耗尽内存。 所以你必须在这里做一些聪明的事情。


多数元素是指出现次数超过数组大小一半的元素。 这意味着多数元素出现的次数多于所有其他元素的总和,或者如果计算多数元素出现的次数,然后减去所有其他元素的次数,您将得到一个正数。

因此,如果您计算某个元素的数量,然后减去所有其他元素的数量并得到数字 0 - 那么您的原始元素不能是多数元素。 这是正确算法的基础:

有两个变量,计数器和可能的元素。 迭代流,如果计数器为 0 - 您覆盖可能的元素并初始化计数器,如果数字与可能的元素相同 - 增加计数器,否则减少它。Python代码:

def majority_element(arr):
    counter, possible_element = 0, None
    for i in arr:
        if counter == 0:
            possible_element, counter = i, 1
        elif i == possible_element:
            counter += 1
        else:
            counter -= 1

    return possible_element

很明显看到该算法的复杂度为 O(n),并且在 O(n) 之前有一个非常小的常数(如 3)。 而且看起来空间复杂度是O(1),因为我们只初始化了三个变量。 问题是这些变量之一是一个计数器,它可能会增长到 n(当数组由相同的数字组成时)。 要存储数字n,您需要O(log (n)) 空间。 因此,从理论角度来看,时间为 O(n),空间为 O(log(n))从实际情况来看,一个 longint 可以容纳 2^128 个数字,而数组中的元素数量之大令人难以想象。

另请注意,该算法仅在存在多数元素时才有效。 如果这样的元素不存在,它仍然会返回一些数字,这肯定是错误的。 (很容易修改算法来判断多数元素是否存在)

历史通道:该算法是由 Boyer, Moore 在 1982 年的某个地方发明的,称为 Boyer–Moore 多数投票算法

This is a standard problem in streaming algorithms (where you have a huge (potentially infinite) stream of data) and you have to calculate some statistics from this stream, passing through this stream once.


Clearly you can approach it with hashing or sorting, but with potentially infinite stream you clearly run out of memory. So you have to do something smart here.


The majority element is the element that occurs more than half of the size of the array. This means that the majority element occurs more than all other elements combined or if you count the number of times, majority element appears, and subtract the number of all other elements, you will get a positive number.

So if you count the number of some element, and subtract the number of all other elements and get the number 0 - then your original element can't be a majority element. This if the basis for a correct algorithm:

Have two variables, counter and possible element. Iterate the stream, if the counter is 0 - your overwrite the possible element and initialize the counter, if the number is the same as possible element - increase the counter, otherwise decrease it. Python code:

def majority_element(arr):
    counter, possible_element = 0, None
    for i in arr:
        if counter == 0:
            possible_element, counter = i, 1
        elif i == possible_element:
            counter += 1
        else:
            counter -= 1

    return possible_element

It is clear to see that the algorithm is O(n) with a very small constant before O(n) (like 3). Also it looks like the space complexity is O(1), because we have only three variable initialized. The problem is that one of these variables is a counter which potentially can grow up to n (when the array consists of the same numbers). And to store the number n you need O(log (n)) space. So from theoretical point of view it is O(n) time and O(log(n)) space. From practical, you can fit 2^128 number in a longint and this number of elements in the array is unimaginably huge.

Also note that the algorithm works only if there is a majority element. If such element does not exist it will still return some number, which will surely be wrong. (it is easy to modify the algorithm to tell whether the majority element exists)

History channel: this algorithm was invented somewhere in 1982 by Boyer, Moore and called Boyer–Moore majority vote algorithm.

幸福还没到 2024-07-16 07:44:33

我记得这个算法,它可能遵循也可能不遵循 2K 规则。 它可能需要使用堆栈等进行重写,以避免由于函数调用而打破内存限制,但这可能是不需要的,因为它只有对数数量的此类调用。 不管怎样,我对大学有模糊的回忆,或者对此有一个涉及分而治之的递归解决方案,秘密是当你将组分成两半时,至少其中一半的值仍然超过一半等于最大值。 除法时的基本规则是返回两个候选最高值,其中一个是最高值,另一个是其他值(可能是也可能不是第二位)。 我忘记了算法本身。

I have recollections of this algorithm, which might or might not follow the 2K rule. It might need to be rewritten with stacks and the like to avoid breaking the memory limits due to function calls, but this might be unneeded since it only ever has a logarithmic number of such calls. Anyhow, I have vague recollections from college or a recursive solution to this which involved divide and conquer, the secret being that when you divide the groups in half, at least one of the halves still has more than half of its values equal to the max. The basic rule when dividing is that you return two candidate top values, one of which is the top value and one of which is some other value (that may or may not be 2nd place). I forget the algorithm itself.

百善笑为先 2024-07-16 07:44:33

buti-oxa / Jason Hernandez 答案的正确性证明,假设 Jason 的答案与 buti-oxa 的答案相同,并且两者都按照所描述的算法应有的方式工作:

如果选择了最高值,我们将调整后的怀疑强度定义为等于怀疑强度或 - 怀疑强度(如果未选择最高值)。 每次您选择正确的数字,当前调整后的怀疑强度就会增加 1。每次您选择错误的数字,它要么下降 1,要么增加 1,具体取决于当前是否选择了错误的数字。 因此,调整怀疑强度后的最小可能结局等于 number-of[top value] - number-of[other value]

Proof of correctness for buti-oxa / Jason Hernandez's answer, assuming Jason's answer is the same as buti-oxa's answer and both work the way the algorithm described should work:

We define adjusted suspicion strength as being equal to suspicion strength if top value is selected or -suspicion strength if top value is not selected. Every time you pick the right number, the current adjusted suspicion strength increases by 1. Each time you pick a wrong number, it either drops by 1 or increases by 1, depending on if the wrong number is currently selected. So, the minimum possible ending adjusted suspicion strength is equal to number-of[top values] - number-of[other values]

夏末 2024-07-16 07:44:32

为每一位保留一个整数,并为数组中的每个整数适当地增加此集合。

最后,某些位的计数将高于数组长度的一半 - 这些位决定 N。当然,计数将高于 N 出现的次数,但这并不重要。 重要的是,不属于 N 的任何位不能出现超过一半的次数(因为 N 拥有超过一半的条目),并且属于 N 的任何位必须< /em> 出现超过一半的次数(因为每次 N 出现时它都会出现,以及任何额外的情况)。

(目前没有代码 - 即将失去网络访问权限。但希望上面的内容足够清楚。)

Keep one integer for each bit, and increment this collection appropriately for each integer in the array.

At the end, some of the bits will have a count higher than half the length of the array - those bits determine N. Of course, the count will be higher than the number of times N occurred, but that doesn't matter. The important thing is that any bit which isn't part of N cannot occur more than half the times (because N has over half the entries) and any bit which is part of N must occur more than half the times (because it will occur every time N occurs, and any extras).

(No code at the moment - about to lose net access. Hopefully the above is clear enough though.)

画骨成沙 2024-07-16 07:44:32

Boyer 和 Moore 的“线性时间多数投票算法” - 沿着数组继续下去,保持您当前对答案的猜测。

Boyer and Moore's "Linear Time Majority Vote Algorithm" - go down the array maintaining your current guess at the answer.

漆黑的白昼 2024-07-16 07:44:32

您只需使用两个变量即可做到这一点。

public uint MostCommon(UInt32[] numberList)
{
    uint suspect = 0;
    int suspicionStrength = -1; 
    foreach (uint number in numberList)
    {
        if (number==suspect)
        {
            suspicionStrength++;
        }
        else
        {
            suspicionStrength--;
        }

        if (suspicionStrength<=0)
        {
            suspect = number;
        }
    }
    return suspect;
}

将第一个数字设为可疑数字,然后继续循环列表。 如果数字匹配,则将怀疑强度加一; 如果不匹配,则将怀疑强度降低一。 如果怀疑强度达到 0,则当前号码成为可疑号码。 这无法找到最常见的数字,只能找到超过该组 50% 的数字。 如果 suspicionStrength 大于列表长度的一半,请抵制添加检查的冲动 - 它总是会导致更多的总比较。

PS 我还没有测试过这段代码 - 使用它需要您自担风险。

You can do this with only two variables.

public uint MostCommon(UInt32[] numberList)
{
    uint suspect = 0;
    int suspicionStrength = -1; 
    foreach (uint number in numberList)
    {
        if (number==suspect)
        {
            suspicionStrength++;
        }
        else
        {
            suspicionStrength--;
        }

        if (suspicionStrength<=0)
        {
            suspect = number;
        }
    }
    return suspect;
}

Make the first number the suspect number, and continue looping through the list. If the number matches, increase the suspicion strength by one; if it doesn't match, lower the suspicion strength by one. If the suspicion strength hits 0 the current number becomes the suspect number. This will not work to find the most common number, only a number that is more than 50% of the group. Resist the urge to add a check if suspicionStrength is greater than half the list length - it will always result in more total comparisons.

P.S. I have not tested this code - use it at your own peril.

杯别 2024-07-16 07:44:32

Jon 算法的伪代码(记事本 C++ :-)):

int lNumbers = (size_of(arrNumbers)/size_of(arrNumbers[0]);

for (int i = 0; i < lNumbers; i++)
  for (int bi = 0; bi < 32; bi++)
    arrBits[i] = arrBits[i] + (arrNumbers[i] & (1 << bi)) == (1 << bi) ? 1 : 0;

int N = 0;

for (int bc = 0; bc < 32; bc++)
  if (arrBits[bc] > lNumbers/2)
    N = N | (1 << bc);

Pseudo code (notepad C++ :-)) for Jon's algorithm:

int lNumbers = (size_of(arrNumbers)/size_of(arrNumbers[0]);

for (int i = 0; i < lNumbers; i++)
  for (int bi = 0; bi < 32; bi++)
    arrBits[i] = arrBits[i] + (arrNumbers[i] & (1 << bi)) == (1 << bi) ? 1 : 0;

int N = 0;

for (int bc = 0; bc < 32; bc++)
  if (arrBits[bc] > lNumbers/2)
    N = N | (1 << bc);
善良天后 2024-07-16 07:44:32

请注意,如果序列 a0, a1, . 。 。 , an−1 包含一个领导者,然后在删除一对之后
不同值的元素,剩余序列仍然具有相同的领导者。 确实,如果我们
删除两个不同的元素,那么只有其中一个可以成为领导者。 领导者在
新序列出现的次数超过 n/2 − 1 = (n−2)/2
次。 因此,它仍然是该领域的领导者。
新的 n − 2 元素序列。

这是一个 Python 实现,时间复杂度为 O(n):

def goldenLeader(A):
    n = len(A)
    size = 0
    for k in xrange(n):
        if (size == 0):
            size += 1
            value = A[k]
        else:
            if (value != A[k]):
                size -= 1
            else:
                size += 1
    candidate = -1
    if (size > 0):
        candidate = value
    leader = -1
    count = 0
    for k in xrange(n):
        if (A[k] == candidate):
            count += 1
    if (count > n // 2):
        leader = candidate
    return leader

Notice that if the sequence a0, a1, . . . , an−1 contains a leader, then after removing a pair of
elements of different values, the remaining sequence still has the same leader. Indeed, if we
remove two different elements then only one of them could be the leader. The leader in the
new sequence occurs more than n/2 − 1 = (n−2)/2
times. Consequently, it is still the leader of the
new sequence of n − 2 elements.

Here is a Python implementation, with O(n) time complexity:

def goldenLeader(A):
    n = len(A)
    size = 0
    for k in xrange(n):
        if (size == 0):
            size += 1
            value = A[k]
        else:
            if (value != A[k]):
                size -= 1
            else:
                size += 1
    candidate = -1
    if (size > 0):
        candidate = value
    leader = -1
    count = 0
    for k in xrange(n):
        if (A[k] == candidate):
            count += 1
    if (count > n // 2):
        leader = candidate
    return leader
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文