Python：在数组中获得最大的出现

发布于 2025-01-27 23:10:57 字数 4429 浏览 2 评论 0原文

我实施了代码，以尝试在 numpy数组中获得最大出现。我使用 numba 感到满意。我想知道是否可以将其改进到一般情况下。

numba实现

import numba as nb
import numpy as np
import collections

@nb.njit("int64(int64[:])")
def max_count_unique_num(x):
    """
    Counts maximum number of unique integer in x.

    Args:
        x (numpy array): Integer array.

    Returns:
        Int

    """
    # get maximum value
    m = x[0]
    for v in x:
        if v > m:
            m = v

    if m == 0:
        return x.size

    # count each unique value
    num = np.zeros(m + 1, dtype=x.dtype)
    for k in x:
        num[k] += 1
    # maximum count
    m = 0
    for k in num:
        if k > m:
            m = k
    return m

用于比较，我还实现了numpy的unique和collections.counter

def np_unique(x):
    """ Counts maximum occurrence using numpy's unique. """
    ux, uc = np.unique(x, return_counts=True)
    return uc.max()


def counter(x):
    """ Counts maximum occurrence using collections.Counter. """
    counts = collections.Counter(x)
    return max(counts.values())

，如@mechanicpig所建议。

In [1]: x = np.random.randint(0, 2000, size=30000).astype(np.int64)
In [2]: %timeit max_count_unique_num(x)
30 µs ± 387 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [3]: %timeit np_unique(x)
1.14 ms ± 1.65 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [4]: %timeit counter(x)
2.68 ms ± 33.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [5]: x = np.random.randint(0, 200000, size=30000).astype(np.int64)
In [6]: %timeit counter(x)
3.07 ms ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [7]: %timeit np_unique(x)
1.3 ms ± 7.35 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [8]: %timeit max_count_unique_num(x)
490 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [9]: x = np.random.randint(0, 2000, size=30000).astype(np.int64)
In [10]: %timeit np.bincount(x).max()
32.3 µs ± 250 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [11]: x = np.random.randint(0, 200000, size=30000).astype(np.int64)
In [12]: %timeit np.bincount(x).max()
830 µs ± 6.09 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

numba实施的局限性很明显：效率仅当x中的所有值都是小的正面int，并且将大大降低非常大的int;不适用于float和负值。

我有什么办法可以概括实现并保持速度？

更新
在检查np.nique的源代码后，一般案例的实现可能是：

@nb.njit(["int64(int64[:])", "int64(float64[:])"])
def max_count_unique_num_2(x):
    x.sort()
    n = 0
    k = 0
    x0 = x[0]
    for v in x:
        if x0 == v:
            k += 1
        else:
            if k > n:
                n = k
            k = 1
            x0 = v
    # for last item in x if it equals to previous one
    if k > n:
        n = k
    return n

timeit

In [154]: x = np.random.randint(0, 200000, size=30000).astype(np.int64)
In [155]: %timeit max_count_unique_num(x)
519 µs ± 5.33 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [156]: %timeit np_unique(x)
1.3 ms ± 9.88 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [157]: %timeit max_count_unique_num_2(x)
240 µs ± 1.92 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [158]: x = np.random.randint(0, 200000, size=300000).astype(np.int64)
In [159]: %timeit max_count_unique_num(x)
1.01 ms ± 7.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [160]: %timeit np_unique(x)
18.1 ms ± 395 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [161]: %timeit max_count_unique_num_2(x)
3.58 ms ± 28.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

so：

如果x中的大整数和大小不大，max_count_unique_num_2 beats max_count_unique_num。
MAX_COUNT_UNICE_NUM和max_count_unique_num_2明显比np.unique要快得多。
max_count_unique_num_2上的小修改可以返回具有最大出现的项目，即使所有具有相同最大发生的项目也是如此。
MAX_COUNT_UNICE_NUM_2如果X本身可以通过删除X.Sort（）进行排序，甚至可以加速加速。

原文

I implemented codes to try to get maximum occurrence in numpy array. I was satisfactory using numba, but got limitations. I wonder whether it can be improved to a general case.

numba implementation

import numba as nb
import numpy as np
import collections

@nb.njit("int64(int64[:])")
def max_count_unique_num(x):
    """
    Counts maximum number of unique integer in x.

    Args:
        x (numpy array): Integer array.

    Returns:
        Int

    """
    # get maximum value
    m = x[0]
    for v in x:
        if v > m:
            m = v

    if m == 0:
        return x.size

    # count each unique value
    num = np.zeros(m + 1, dtype=x.dtype)
    for k in x:
        num[k] += 1
    # maximum count
    m = 0
    for k in num:
        if k > m:
            m = k
    return m

For comparisons, I also implemented numpy's unique and collections.Counter

def np_unique(x):
    """ Counts maximum occurrence using numpy's unique. """
    ux, uc = np.unique(x, return_counts=True)
    return uc.max()


def counter(x):
    """ Counts maximum occurrence using collections.Counter. """
    counts = collections.Counter(x)
    return max(counts.values())

timeit

Edit: Add np.bincount for additional comparison, as suggested by @MechanicPig.

In [1]: x = np.random.randint(0, 2000, size=30000).astype(np.int64)
In [2]: %timeit max_count_unique_num(x)
30 µs ± 387 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [3]: %timeit np_unique(x)
1.14 ms ± 1.65 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [4]: %timeit counter(x)
2.68 ms ± 33.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [5]: x = np.random.randint(0, 200000, size=30000).astype(np.int64)
In [6]: %timeit counter(x)
3.07 ms ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [7]: %timeit np_unique(x)
1.3 ms ± 7.35 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [8]: %timeit max_count_unique_num(x)
490 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [9]: x = np.random.randint(0, 2000, size=30000).astype(np.int64)
In [10]: %timeit np.bincount(x).max()
32.3 µs ± 250 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [11]: x = np.random.randint(0, 200000, size=30000).astype(np.int64)
In [12]: %timeit np.bincount(x).max()
830 µs ± 6.09 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

The limitations of numba implementation are quite obvious: efficiency only when all values in x are small positive int and will be significantly reduced for very large int; not applicable to float and negative values.

Any way I can generalize the implementation and keep the speed?

Update
After checking the source code of np.unique, an implementation for general cases can be:

@nb.njit(["int64(int64[:])", "int64(float64[:])"])
def max_count_unique_num_2(x):
    x.sort()
    n = 0
    k = 0
    x0 = x[0]
    for v in x:
        if x0 == v:
            k += 1
        else:
            if k > n:
                n = k
            k = 1
            x0 = v
    # for last item in x if it equals to previous one
    if k > n:
        n = k
    return n

timeit

In [154]: x = np.random.randint(0, 200000, size=30000).astype(np.int64)
In [155]: %timeit max_count_unique_num(x)
519 µs ± 5.33 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [156]: %timeit np_unique(x)
1.3 ms ± 9.88 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [157]: %timeit max_count_unique_num_2(x)
240 µs ± 1.92 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [158]: x = np.random.randint(0, 200000, size=300000).astype(np.int64)
In [159]: %timeit max_count_unique_num(x)
1.01 ms ± 7.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [160]: %timeit np_unique(x)
18.1 ms ± 395 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [161]: %timeit max_count_unique_num_2(x)
3.58 ms ± 28.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

So:

If large integer in x and the size is not large, max_count_unique_num_2 beats max_count_unique_num.
Both max_count_unique_num and max_count_unique_num_2 are significantly faster than np.unique.
Small modification on max_count_unique_num_2 can return the item that has maximum occurrence, even all items having same maximum occurrence.
max_count_unique_num_2 can even be accelerated if x is itself sorted by removing x.sort().

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

神回复 2025-02-03 23:10:57

如果缩短代码：

@nb.njit("int64(int64[:])", fastmath=True)
def shortened(x):
    num = np.zeros(x.max() + 1, dtype=x.dtype)
    for k in x:
        num[k] += 1

    return num.max()

或并行：

@nb.njit("int64(int64[:])", parallel=True, fastmath=True)
def shortened_paralleled(x):
    num = np.zeros(x.max() + 1, dtype=x.dtype)
    for k in nb.prange(x.size):
        num[x[k]] += 1

    return num.max()

并行化将对较大的数据大小进行击败。请注意，并行在某些运行中会获得不同的结果，并且需要治愈。

用于使用numba：

@nb.njit("int8(float64[:])", fastmath=True)
def shortened_float(x):
    num = np.zeros(x.size, dtype=np.int8)
    for k in x:
        for j in range(x.shape[0]):
            if k == x[j]:
                num[j] += 1

    return num.max()

imo，np.unique（x，return_counts = true）[1] .max（）是处理整数和漂浮在A中的最佳选择实施非常快。 NUMBA对于整数来说可能更快（这取决于数据大小，因为较大的数据大小的性能较弱； AIK，是循环本能所致），但对于浮子，必须根据性能优化代码。但是我认为Numba可以击败numpy unique， ，尤其是当我们面对大数据 时。

注意：np.bincount可以处理只是整数。

What if shortening your code:

@nb.njit("int64(int64[:])", fastmath=True)
def shortened(x):
    num = np.zeros(x.max() + 1, dtype=x.dtype)
    for k in x:
        num[k] += 1

    return num.max()

or paralleled:

@nb.njit("int64(int64[:])", parallel=True, fastmath=True)
def shortened_paralleled(x):
    num = np.zeros(x.max() + 1, dtype=x.dtype)
    for k in nb.prange(x.size):
        num[x[k]] += 1

    return num.max()

Parallelizing will beat for larger data sizes. Note that parallel will get different result in some runs and need to be cured if be possible.

For handling the floats (or negative values) using Numba:

@nb.njit("int8(float64[:])", fastmath=True)
def shortened_float(x):
    num = np.zeros(x.size, dtype=np.int8)
    for k in x:
        for j in range(x.shape[0]):
            if k == x[j]:
                num[j] += 1

    return num.max()

IMO, np.unique(x, return_counts=True)[1].max() is the best choice which handle both integers and floats in a very fast implementation. Numba can be faster for integers (it depends on the data sizes as larger data sizes weaker performance; AIK, it is due to looping instinct than arrays), but for floats the code must be optimized in terms of performance if it could; But I don't think that Numba can beat NumPy unique, particularly when we faced to large data.

Notes: np.bincount can handle just integers.

回复收藏 0 原文

梦魇绽荼蘼 2025-02-03 23:10:57

您也可以在不使用numpy的情况下执行此操作。

arr = [1,1,2,2,3,3,4,5,6,1,3,5,7,1]
counts = list(map(list(arr).count, set(arr)))
list(set(arr))[counts.index(max(counts))]

如果要使用numpy，请尝试使用此操作，

arr = np.array([1,1,2,2,3,3,4,5,6,1,3,5,7,1])
uniques, counts = np.unique(arr, return_counts = True)
uniques[np.where(counts == counts.max())]

都可以做完全相同的工作。要检查哪种方法更有效，只需执行此操作，

time_i = time.time()
<arr declaration> # Creating a new array each iteration can cause the total time to increase which would be biased against the numpy method. 
for i in range(10**5):
  <method you want>
time_f = time.time()

当我运行此方法时，我的第一种方法获得了0.39秒，第二种方法为2.69。因此，可以肯定地说，第一种方法更有效。

You can do that without using numpy too.

arr = [1,1,2,2,3,3,4,5,6,1,3,5,7,1]
counts = list(map(list(arr).count, set(arr)))
list(set(arr))[counts.index(max(counts))]

If you want to use numpy then try this,

arr = np.array([1,1,2,2,3,3,4,5,6,1,3,5,7,1])
uniques, counts = np.unique(arr, return_counts = True)
uniques[np.where(counts == counts.max())]

Both do the exact same job. To check which method is more efficient just do this,

time_i = time.time()
<arr declaration> # Creating a new array each iteration can cause the total time to increase which would be biased against the numpy method. 
for i in range(10**5):
  <method you want>
time_f = time.time()

When I ran this I got 0.39 seconds for the first method and 2.69 for the second one. So it's pretty safe to say that the first method is more efficient.

回复收藏 0 原文

薄荷梦 2025-02-03 23:10:57

我要说的是，您的实现几乎与numpy.bincount相同。如果要使它通用，则可以考虑编码原始数据：

def encode(ar):
    # Equivalent to numpy.unique(ar, return_inverse=True)[1] when ar.ndim == 1
    flatten = ar.ravel()
    perm = flatten.argsort()
    sort = flatten[perm]

    mask = np.concatenate(([False], sort[1:] != sort[:-1]))
    encoded = np.empty(sort.shape, np.int64)
    encoded[perm] = mask.cumsum()
    encoded.shape = ar.shape

    return encoded

def count_max(ar):
    return max_count_unique_num(encode(ar))

What I want to say is that your implementation is almost the same as numpy.bincount. If you want to make it universal, you can consider encoding the original data:

def encode(ar):
    # Equivalent to numpy.unique(ar, return_inverse=True)[1] when ar.ndim == 1
    flatten = ar.ravel()
    perm = flatten.argsort()
    sort = flatten[perm]

    mask = np.concatenate(([False], sort[1:] != sort[:-1]))
    encoded = np.empty(sort.shape, np.int64)
    encoded[perm] = mask.cumsum()
    encoded.shape = ar.shape

    return encoded

def count_max(ar):
    return max_count_unique_num(encode(ar))

回复收藏 0 原文

~没有更多了~