C 中的滚动中值算法

发布于 2024-08-02 16:10:40 字数 685 浏览 13 评论 0原文

我目前正在研究一种用 C 语言实现滚动中值滤波器（类似于滚动均值滤波器）的算法。从我对文献的搜索来看，似乎有两种相当有效的方法可以实现。第一种是对初始值窗口进行排序，然后执行二分搜索以插入新值并在每次迭代时删除现有值。

第二个（来自 Hardle 和 Steiger，1995，JRSS-C，算法 296）构建了一个双端堆结构，一端是 maxheap，另一端是 minheap，中间是中位数。这产生了一种线性时间算法，而不是 O(n log n) 的算法。

这是我的问题：实现前者是可行的，但我需要在数百万个时间序列上运行它，因此效率非常重要。事实证明后者非常难以实施。我在 R 的 stats 包代码的 Trunmed.c 文件中找到了代码，但它相当难以解读。

有谁知道线性时间滚动中值算法的编写良好的 C 实现吗？

编辑：链接到 Trunmed.c 代码 http://google.com/codesearch/p?hl=en&sa=N&cd=1&ct=rc#mYw3h_Lb_e0/R-2.2 .0/src/library/stats/src/Trunmed.c

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我最亲爱的 2024-08-09 16:10:40

我已经看过 R 的 src/library/stats/src/Trunmed.c 几次了，因为我想在独立的 C++ 类/C 子例程中也有类似的东西。请注意，这实际上是两个实现合二为一，请参阅 src/library/stats/man/runmed.Rd（帮助文件的源代码），其中表示

\details{
  Apart from the end values, the result \code{y = runmed(x, k)} simply has
  \code{y[j] = median(x[(j-k2):(j+k2)])} (k = 2*k2+1), computed very
  efficiently.

  The two algorithms are internally entirely different:
  \describe{
    \item{"Turlach"}{is the Härdle-Steiger
      algorithm (see Ref.) as implemented by Berwin Turlach.
      A tree algorithm is used, ensuring performance \eqn{O(n \log
        k)}{O(n * log(k))} where \code{n <- length(x)} which is
      asymptotically optimal.}
    \item{"Stuetzle"}{is the (older) Stuetzle-Friedman implementation
      which makes use of median \emph{updating} when one observation
      enters and one leaves the smoothing window.  While this performs as
      \eqn{O(n \times k)}{O(n * k)} which is slower asymptotically, it is
      considerably faster for small \eqn{k} or \eqn{n}.}
  }
}

It would be happy to see this re-used in更加独立的时尚。你是自愿的吗？我可以帮助解决一些 R 位问题。

编辑 1：除了上面旧版本 Trunmed.c 的链接之外，这里还有

Srunmed.c（对于 Stuetzle 版本）
Trunmed.c（对于 Turlach 版本）
runmed.R R 函数

对于调用这些编辑 2 的：Ryan Tibshirani 在快速中值分箱，这可能是窗口方法的合适起点。

I have looked at R's src/library/stats/src/Trunmed.c a few times as I wanted something similar too in a standalone C++ class / C subroutine. Note that this are actually two implementations in one, see src/library/stats/man/runmed.Rd (the source of the help file) which says

\details{
  Apart from the end values, the result \code{y = runmed(x, k)} simply has
  \code{y[j] = median(x[(j-k2):(j+k2)])} (k = 2*k2+1), computed very
  efficiently.

  The two algorithms are internally entirely different:
  \describe{
    \item{"Turlach"}{is the Härdle-Steiger
      algorithm (see Ref.) as implemented by Berwin Turlach.
      A tree algorithm is used, ensuring performance \eqn{O(n \log
        k)}{O(n * log(k))} where \code{n <- length(x)} which is
      asymptotically optimal.}
    \item{"Stuetzle"}{is the (older) Stuetzle-Friedman implementation
      which makes use of median \emph{updating} when one observation
      enters and one leaves the smoothing window.  While this performs as
      \eqn{O(n \times k)}{O(n * k)} which is slower asymptotically, it is
      considerably faster for small \eqn{k} or \eqn{n}.}
  }
}

It would be nice to see this re-used in a more standalone fashion. Are you volunteering? I can help with some of the R bits.

Edit 1: Besides the link to the older version of Trunmed.c above, here are current SVN copies of

Srunmed.c (for the Stuetzle version)
Trunmed.c (for the Turlach version)
runmed.R for the R function calling these

Edit 2: Ryan Tibshirani has some C and Fortran code on fast median binning which may be a suitable starting point for a windowed approach.

回复收藏 0 原文

拥醉 2024-08-09 16:10:40

我找不到具有顺序统计的 C++ 数据结构的现代实现，因此最终在 MAK 建议的顶级编码器链接中实现了这两种想法（比赛编辑：向下滚动到 FloatingMedian）。

两个多重集

第一个想法将数据划分为两个数据结构（堆、多重集等），每次插入/删除的时间复杂度为 O(ln N)，不允许在没有很大成本的情况下动态更改分位数。即我们可以有滚动中位数，或滚动 75%，但不能同时两者。

线段树

第二种想法使用线段树，它的插入/删除/查询时间复杂度为 O(ln N)，但更灵活。最重要的是“N”是数据范围的大小。因此，如果您的滚动中位数有一个包含 100 万个项目的窗口，但您的数据从 1..65536 变化，那么 100 万个滚动窗口的每次移动只需要 16 次操作！

C++ 代码类似于 Denis 上面发布的内容（“这是一个用于量化数据的简单算法”）

GNU 阶数统计树

就在放弃之前，我发现 stdlibc++ 包含阶数统计树！

它们有两个关键操作：

iter = tree.find_by_order(value)
order = tree.order_of_key(value)

请参阅libstdc++手册policy_based_data_structs_test（搜索“split and join”）。

我已将树包装在一个方便的标头中，供支持 c++0x/c++11 样式部分 typedef 的编译器使用：

#if !defined(GNU_ORDER_STATISTIC_SET_H)
#define GNU_ORDER_STATISTIC_SET_H
#include <ext/pb_ds/assoc_container.hpp>
#include <ext/pb_ds/tree_policy.hpp>

// A red-black tree table storing ints and their order
// statistics. Note that since the tree uses
// tree_order_statistics_node_update as its update policy, then it
// includes its methods by_order and order_of_key.
template <typename T>
using t_order_statistic_set = __gnu_pbds::tree<
                                  T,
                                  __gnu_pbds::null_type,
                                  std::less<T>,
                                  __gnu_pbds::rb_tree_tag,
                                  // This policy updates nodes'  metadata for order statistics.
                                  __gnu_pbds::tree_order_statistics_node_update>;

#endif //GNU_ORDER_STATISTIC_SET_H

I couldn't find a modern implementation of a c++ data structure with order-statistic so ended up implementing both ideas in top coders link suggested by MAK ( Match Editorial: scroll down to FloatingMedian).

Two multisets

The first idea partitions the data into two data structures (heaps, multisets etc) with O(ln N) per insert/delete does not allow the quantile to be changed dynamically without a large cost. I.e. we can have a rolling median, or a rolling 75% but not both at the same time.

Segment tree

The second idea uses a segment tree which is O(ln N) for insert/deletions/queries but is more flexible. Best of all the "N" is the size of your data range. So if your rolling median has a window of a million items, but your data varies from 1..65536, then only 16 operations are required per movement of the rolling window of 1 million!!

The c++ code is similar to what Denis posted above ("Here's a simple algorithm for quantized data")

GNU Order Statistic Trees

Just before giving up, I found that stdlibc++ contains order statistic trees!!!

These have two critical operations:

iter = tree.find_by_order(value)
order = tree.order_of_key(value)

See libstdc++ manual policy_based_data_structures_test (search for "split and join").

I have wrapped the tree for use in a convenience header for compilers supporting c++0x/c++11 style partial typedefs:

#if !defined(GNU_ORDER_STATISTIC_SET_H)
#define GNU_ORDER_STATISTIC_SET_H
#include <ext/pb_ds/assoc_container.hpp>
#include <ext/pb_ds/tree_policy.hpp>

// A red-black tree table storing ints and their order
// statistics. Note that since the tree uses
// tree_order_statistics_node_update as its update policy, then it
// includes its methods by_order and order_of_key.
template <typename T>
using t_order_statistic_set = __gnu_pbds::tree<
                                  T,
                                  __gnu_pbds::null_type,
                                  std::less<T>,
                                  __gnu_pbds::rb_tree_tag,
                                  // This policy updates nodes'  metadata for order statistics.
                                  __gnu_pbds::tree_order_statistics_node_update>;

#endif //GNU_ORDER_STATISTIC_SET_H

回复收藏 0 原文

末骤雨初歇 2024-08-09 16:10:40

我已经完成了 C 实现此处。这个问题中有更多细节：C 中的滚动中位数 - Turlach 实现。

使用示例：

int main(int argc, char* argv[])
{
   int i, v;
   Mediator* m = MediatorNew(15);
 
   for (i=0; i<30; i++) {
      v = rand() & 127;
      printf("Inserting %3d \n", v);
      MediatorInsert(m, v);
      v = MediatorMedian(m);
      printf("Median = %3d.\n\n", v);
      ShowTree(m);
   }
}

I've done a C implementation here. A few more details are in this question: Rolling median in C - Turlach implementation.

Sample usage:

int main(int argc, char* argv[])
{
   int i, v;
   Mediator* m = MediatorNew(15);
 
   for (i=0; i<30; i++) {
      v = rand() & 127;
      printf("Inserting %3d \n", v);
      MediatorInsert(m, v);
      v = MediatorMedian(m);
      printf("Median = %3d.\n\n", v);
      ShowTree(m);
   }
}

回复收藏 0 原文

jJeQQOZ5 2024-08-09 16:10:40

我使用这个增量中值估计器：

median += eta * sgn(sample - median)

它与更常见的均值估计器具有相同的形式：

mean += eta * (sample - mean)

这里eta是一个小的学习率参数（例如0.001），并且sgn() 是返回 {-1, 0, 1} 之一的符号函数。（如果数据是非平稳的并且您想要跟踪随时间的变化，请使用这样的常量 eta；否则，对于固定源，请使用类似 eta = 1 / n收敛，其中 n 是迄今为止看到的样本数。）

此外，我修改了中值估计器以使其适用于任意分位数。一般来说，分位数函数告诉您将数据分为两个分数的值：p 和 1 - p。下面逐步估计该值：

quantile += eta * (sgn(sample - quantile) + 2.0 * p - 1.0)

值 p 应在 [0, 1] 范围内。这本质上将 sgn() 函数的对称输出 {-1, 0, 1} 转向一侧，将数据样本划分为两个大小不等的容器（分数）数据的 p 和 1 - p 分别小于/大于分位数估计）。请注意，对于 p = 0.5，这会简化为中值估计量。

I use this incremental median estimator:

median += eta * sgn(sample - median)

which has the same form as the more common mean estimator:

mean += eta * (sample - mean)

Here eta is a small learning rate parameter (e.g. 0.001), and sgn() is the signum function which returns one of {-1, 0, 1}. (Use a constant eta like this if the data is non-stationary and you want to track changes over time; otherwise, for stationary sources use something like eta = 1 / n to converge, where n is the number of samples seen so far.)

Also, I modified the median estimator to make it work for arbitrary quantiles. In general, a quantile function tells you the value that divides the data into two fractions: p and 1 - p. The following estimates this value incrementally:

quantile += eta * (sgn(sample - quantile) + 2.0 * p - 1.0)

The value p should be within [0, 1]. This essentially shifts the sgn() function's symmetrical output {-1, 0, 1} to lean toward one side, partitioning the data samples into two unequally-sized bins (fractions p and 1 - p of the data are less than/greater than the quantile estimate, respectively). Note that for p = 0.5, this reduces to the median estimator.

回复收藏 0 原文

苦妄 2024-08-09 16:10:40

这是量化数据的简单算法（几个月后）：

""" median1.py: moving median 1d for quantized, e.g. 8-bit data

Method: cache the median, so that wider windows are faster.
    The code is simple -- no heaps, no trees.

Keywords: median filter, moving median, running median, numpy, scipy

See Perreault + Hebert, Median Filtering in Constant Time, 2007,
    http://nomis80.org/ctmf.html: nice 6-page paper and C code,
    mainly for 2d images

Example:
    y = medians( x, window=window, nlevel=nlevel )
    uses:
    med = Median1( nlevel, window, counts=np.bincount( x[0:window] ))
    med.addsub( +, - )  -- see the picture in Perreault
    m = med.median()  -- using cached m, summ

How it works:
    picture nlevel=8, window=3 -- 3 1s in an array of 8 counters:
        counts: . 1 . . 1 . 1 .
        sums:   0 1 1 1 2 2 3 3
                        ^ sums[3] < 2 <= sums[4] <=> median 4
        addsub( 0, 1 )  m, summ stay the same
        addsub( 5, 1 )  slide right
        addsub( 5, 6 )  slide left

Updating `counts` in an `addsub` is trivial, updating `sums` is not.
But we can cache the previous median `m` and the sum to m `summ`.
The less often the median changes, the faster;
so fewer levels or *wider* windows are faster.
(Like any cache, run time varies a lot, depending on the input.)

See also:
    scipy.signal.medfilt -- runtime roughly ~ window size
    http://stackoverflow.com/questions/1309263/rolling-median-algorithm-in-c

"""

from __future__ import division
import numpy as np  # bincount, pad0

__date__ = "2009-10-27 oct"
__author_email__ = "denis-bz-py at t-online dot de"


#...............................................................................
class Median1:
    """ moving median 1d for quantized, e.g. 8-bit data """

    def __init__( s, nlevel, window, counts ):
        s.nlevel = nlevel  # >= len(counts)
        s.window = window  # == sum(counts)
        s.half = (window // 2) + 1  # odd or even
        s.setcounts( counts )

    def median( s ):
        """ step up or down until sum cnt to m-1 < half <= sum to m """
        if s.summ - s.cnt[s.m] < s.half <= s.summ:
            return s.m
        j, sumj = s.m, s.summ
        if sumj <= s.half:
            while j < s.nlevel - 1:
                j += 1
                sumj += s.cnt[j]
                # print "j sumj:", j, sumj
                if sumj - s.cnt[j] < s.half <= sumj:  break
        else:
            while j > 0:
                sumj -= s.cnt[j]
                j -= 1
                # print "j sumj:", j, sumj
                if sumj - s.cnt[j] < s.half <= sumj:  break
        s.m, s.summ = j, sumj
        return s.m

    def addsub( s, add, sub ):
        s.cnt[add] += 1
        s.cnt[sub] -= 1
        assert s.cnt[sub] >= 0, (add, sub)
        if add <= s.m:
            s.summ += 1
        if sub <= s.m:
            s.summ -= 1

    def setcounts( s, counts ):
        assert len(counts) <= s.nlevel, (len(counts), s.nlevel)
        if len(counts) < s.nlevel:
            counts = pad0__( counts, s.nlevel )  # numpy array / list
        sumcounts = sum(counts)
        assert sumcounts == s.window, (sumcounts, s.window)
        s.cnt = counts
        s.slowmedian()

    def slowmedian( s ):
        j, sumj = -1, 0
        while sumj < s.half:
            j += 1
            sumj += s.cnt[j]
        s.m, s.summ = j, sumj

    def __str__( s ):
        return ("median %d: " % s.m) + \
            "".join([ (" ." if c == 0 else "%2d" % c) for c in s.cnt ])

#...............................................................................
def medianfilter( x, window, nlevel=256 ):
    """ moving medians, y[j] = median( x[j:j+window] )
        -> a shorter list, len(y) = len(x) - window + 1
    """
    assert len(x) >= window, (len(x), window)
    # np.clip( x, 0, nlevel-1, out=x )
        # cf http://scipy.org/Cookbook/Rebinning
    cnt = np.bincount( x[0:window] )
    med = Median1( nlevel=nlevel, window=window, counts=cnt )
    y = (len(x) - window + 1) * [0]
    y[0] = med.median()
    for j in xrange( len(x) - window ):
        med.addsub( x[j+window], x[j] )
        y[j+1] = med.median()
    return y  # list
    # return np.array( y )

def pad0__( x, tolen ):
    """ pad x with 0 s, numpy array or list """
    n = tolen - len(x)
    if n > 0:
        try:
            x = np.r_[ x, np.zeros( n, dtype=x[0].dtype )]
        except NameError:
            x += n * [0]
    return x

#...............................................................................
if __name__ == "__main__":
    Len = 10000
    window = 3
    nlevel = 256
    period = 100

    np.set_printoptions( 2, threshold=100, edgeitems=10 )
    # print medians( np.arange(3), 3 )

    sinwave = (np.sin( 2 * np.pi * np.arange(Len) / period )
        + 1) * (nlevel-1) / 2
    x = np.asarray( sinwave, int )
    print "x:", x
    for window in ( 3, 31, 63, 127, 255 ):
        if window > Len:  continue
        print "medianfilter: Len=%d window=%d nlevel=%d:" % (Len, window, nlevel)
            y = medianfilter( x, window=window, nlevel=nlevel )
        print np.array( y )

# end median1.py

Here's a simple algorithm for quantized data (months later):

""" median1.py: moving median 1d for quantized, e.g. 8-bit data

Method: cache the median, so that wider windows are faster.
    The code is simple -- no heaps, no trees.

Keywords: median filter, moving median, running median, numpy, scipy

See Perreault + Hebert, Median Filtering in Constant Time, 2007,
    http://nomis80.org/ctmf.html: nice 6-page paper and C code,
    mainly for 2d images

Example:
    y = medians( x, window=window, nlevel=nlevel )
    uses:
    med = Median1( nlevel, window, counts=np.bincount( x[0:window] ))
    med.addsub( +, - )  -- see the picture in Perreault
    m = med.median()  -- using cached m, summ

How it works:
    picture nlevel=8, window=3 -- 3 1s in an array of 8 counters:
        counts: . 1 . . 1 . 1 .
        sums:   0 1 1 1 2 2 3 3
                        ^ sums[3] < 2 <= sums[4] <=> median 4
        addsub( 0, 1 )  m, summ stay the same
        addsub( 5, 1 )  slide right
        addsub( 5, 6 )  slide left

Updating `counts` in an `addsub` is trivial, updating `sums` is not.
But we can cache the previous median `m` and the sum to m `summ`.
The less often the median changes, the faster;
so fewer levels or *wider* windows are faster.
(Like any cache, run time varies a lot, depending on the input.)

See also:
    scipy.signal.medfilt -- runtime roughly ~ window size
    http://stackoverflow.com/questions/1309263/rolling-median-algorithm-in-c

"""

from __future__ import division
import numpy as np  # bincount, pad0

__date__ = "2009-10-27 oct"
__author_email__ = "denis-bz-py at t-online dot de"


#...............................................................................
class Median1:
    """ moving median 1d for quantized, e.g. 8-bit data """

    def __init__( s, nlevel, window, counts ):
        s.nlevel = nlevel  # >= len(counts)
        s.window = window  # == sum(counts)
        s.half = (window // 2) + 1  # odd or even
        s.setcounts( counts )

    def median( s ):
        """ step up or down until sum cnt to m-1 < half <= sum to m """
        if s.summ - s.cnt[s.m] < s.half <= s.summ:
            return s.m
        j, sumj = s.m, s.summ
        if sumj <= s.half:
            while j < s.nlevel - 1:
                j += 1
                sumj += s.cnt[j]
                # print "j sumj:", j, sumj
                if sumj - s.cnt[j] < s.half <= sumj:  break
        else:
            while j > 0:
                sumj -= s.cnt[j]
                j -= 1
                # print "j sumj:", j, sumj
                if sumj - s.cnt[j] < s.half <= sumj:  break
        s.m, s.summ = j, sumj
        return s.m

    def addsub( s, add, sub ):
        s.cnt[add] += 1
        s.cnt[sub] -= 1
        assert s.cnt[sub] >= 0, (add, sub)
        if add <= s.m:
            s.summ += 1
        if sub <= s.m:
            s.summ -= 1

    def setcounts( s, counts ):
        assert len(counts) <= s.nlevel, (len(counts), s.nlevel)
        if len(counts) < s.nlevel:
            counts = pad0__( counts, s.nlevel )  # numpy array / list
        sumcounts = sum(counts)
        assert sumcounts == s.window, (sumcounts, s.window)
        s.cnt = counts
        s.slowmedian()

    def slowmedian( s ):
        j, sumj = -1, 0
        while sumj < s.half:
            j += 1
            sumj += s.cnt[j]
        s.m, s.summ = j, sumj

    def __str__( s ):
        return ("median %d: " % s.m) + \
            "".join([ (" ." if c == 0 else "%2d" % c) for c in s.cnt ])

#...............................................................................
def medianfilter( x, window, nlevel=256 ):
    """ moving medians, y[j] = median( x[j:j+window] )
        -> a shorter list, len(y) = len(x) - window + 1
    """
    assert len(x) >= window, (len(x), window)
    # np.clip( x, 0, nlevel-1, out=x )
        # cf http://scipy.org/Cookbook/Rebinning
    cnt = np.bincount( x[0:window] )
    med = Median1( nlevel=nlevel, window=window, counts=cnt )
    y = (len(x) - window + 1) * [0]
    y[0] = med.median()
    for j in xrange( len(x) - window ):
        med.addsub( x[j+window], x[j] )
        y[j+1] = med.median()
    return y  # list
    # return np.array( y )

def pad0__( x, tolen ):
    """ pad x with 0 s, numpy array or list """
    n = tolen - len(x)
    if n > 0:
        try:
            x = np.r_[ x, np.zeros( n, dtype=x[0].dtype )]
        except NameError:
            x += n * [0]
    return x

#...............................................................................
if __name__ == "__main__":
    Len = 10000
    window = 3
    nlevel = 256
    period = 100

    np.set_printoptions( 2, threshold=100, edgeitems=10 )
    # print medians( np.arange(3), 3 )

    sinwave = (np.sin( 2 * np.pi * np.arange(Len) / period )
        + 1) * (nlevel-1) / 2
    x = np.asarray( sinwave, int )
    print "x:", x
    for window in ( 3, 31, 63, 127, 255 ):
        if window > Len:  continue
        print "medianfilter: Len=%d window=%d nlevel=%d:" % (Len, window, nlevel)
            y = medianfilter( x, window=window, nlevel=nlevel )
        print np.array( y )

# end median1.py

回复收藏 0 原文

屋顶上的小猫咪 2024-08-09 16:10:40

滚动中位数可以通过维护两个数字分区来找到。

为了维护分区，请使用最小堆和最大堆。

最大堆将包含小于等于中位数的数字。

最小堆将包含大于等于中位数的数字。

平衡约束：
如果元素总数是偶数，那么两个堆应该具有相同的元素。

如果元素总数为奇数，则最大堆将比最小堆多一个元素。

中值元素：如果两个分区具有相同数量的元素，则中值将是第一个分区中的最大元素与第二个分区中的最小元素之和的一半。

否则中位数将是第一个分区中的最大元素。

Algorithm-
1- Take two Heap(1 Min Heap and 1 Max Heap)
   Max Heap will contain first half number of elements
   Min Heap will contain second half number of elements

2- Compare new number from stream with top of Max Heap, 
   if it is smaller or equal add that number in max heap. 
   Otherwise add number in Min Heap.

3- if min Heap has more elements than Max Heap 
   then remove top element of Min Heap and add in Max Heap.
   if max Heap has more than one element than in Min Heap 
   then remove top element of Max Heap and add in Min Heap.

4- If Both heaps has equal number of elements then
   median will be half of sum of max element from Max Heap and min element from Min Heap.
   Otherwise median will be max element from the first partition.

public class Solution {

    public static void main(String[] args) {
        Scanner in = new Scanner(System.in);
        RunningMedianHeaps s = new RunningMedianHeaps();
        int n = in.nextInt();
        for(int a_i=0; a_i < n; a_i++){
            printMedian(s,in.nextInt());
        }
        in.close();       
    }

    public static void printMedian(RunningMedianHeaps s, int nextNum){
            s.addNumberInHeap(nextNum);
            System.out.printf("%.1f\n",s.getMedian());
    }
}

class RunningMedianHeaps{
    PriorityQueue<Integer> minHeap = new PriorityQueue<Integer>();
    PriorityQueue<Integer> maxHeap = new PriorityQueue<Integer>(Comparator.reverseOrder());

    public double getMedian() {

        int size = minHeap.size() + maxHeap.size();     
        if(size % 2 == 0)
            return (maxHeap.peek()+minHeap.peek())/2.0;
        return maxHeap.peek()*1.0;
    }

    private void balanceHeaps() {
        if(maxHeap.size() < minHeap.size())
        {
            maxHeap.add(minHeap.poll());
        }   
        else if(maxHeap.size() > 1+minHeap.size())
        {
            minHeap.add(maxHeap.poll());
        }
    }

    public void addNumberInHeap(int num) {
        if(maxHeap.size()==0 || num <= maxHeap.peek())
        {
            maxHeap.add(num);
        }
        else
        {
            minHeap.add(num);
        }
        balanceHeaps();
    }
}

Rolling median can be found by maintaining two partitions of numbers.

For maintaining partitions use Min Heap and Max Heap.

Max Heap will contain numbers smaller than equal to median.

Min Heap will contain numbers greater than equal to median.

Balancing Constraint:
if total number of elements are even then both heap should have equal elements.

if total number of elements are odd then Max Heap will have one more element than Min Heap.

Median Element: If Both partitions has equal number of elements then median will be half of sum of max element from first partition and min element from second partition.

Otherwise median will be max element from first partition.

Algorithm-
1- Take two Heap(1 Min Heap and 1 Max Heap)
   Max Heap will contain first half number of elements
   Min Heap will contain second half number of elements

2- Compare new number from stream with top of Max Heap, 
   if it is smaller or equal add that number in max heap. 
   Otherwise add number in Min Heap.

3- if min Heap has more elements than Max Heap 
   then remove top element of Min Heap and add in Max Heap.
   if max Heap has more than one element than in Min Heap 
   then remove top element of Max Heap and add in Min Heap.

4- If Both heaps has equal number of elements then
   median will be half of sum of max element from Max Heap and min element from Min Heap.
   Otherwise median will be max element from the first partition.

public class Solution {

    public static void main(String[] args) {
        Scanner in = new Scanner(System.in);
        RunningMedianHeaps s = new RunningMedianHeaps();
        int n = in.nextInt();
        for(int a_i=0; a_i < n; a_i++){
            printMedian(s,in.nextInt());
        }
        in.close();       
    }

    public static void printMedian(RunningMedianHeaps s, int nextNum){
            s.addNumberInHeap(nextNum);
            System.out.printf("%.1f\n",s.getMedian());
    }
}

class RunningMedianHeaps{
    PriorityQueue<Integer> minHeap = new PriorityQueue<Integer>();
    PriorityQueue<Integer> maxHeap = new PriorityQueue<Integer>(Comparator.reverseOrder());

    public double getMedian() {

        int size = minHeap.size() + maxHeap.size();     
        if(size % 2 == 0)
            return (maxHeap.peek()+minHeap.peek())/2.0;
        return maxHeap.peek()*1.0;
    }

    private void balanceHeaps() {
        if(maxHeap.size() < minHeap.size())
        {
            maxHeap.add(minHeap.poll());
        }   
        else if(maxHeap.size() > 1+minHeap.size())
        {
            minHeap.add(maxHeap.poll());
        }
    }

    public void addNumberInHeap(int num) {
        if(maxHeap.size()==0 || num <= maxHeap.peek())
        {
            maxHeap.add(num);
        }
        else
        {
            minHeap.add(num);
        }
        balanceHeaps();
    }
}

回复收藏 0 原文

触ぅ动初心 2024-08-09 16:10:40

也许值得指出的是，有一种特殊情况有一个简单的精确解决方案：当流中的所有值都是（相对）小的定义范围内的整数时。例如，假设它们必须全部位于 0 到 1023 之间。在这种情况下，只需定义一个包含 1024 个元素的数组和一个计数，并清除所有这些值。对于流中的每个值，增加相应的 bin 和计数。流结束后，找到包含 count/2 最高值的 bin - 通过添加从 0 开始的连续 bin 即可轻松完成。使用相同的方法，可以找到任意排名顺序的值。（如果需要在运行期间检测存储箱饱和度并将存储箱的大小“升级”为更大的类型，则会出现一个小复杂情况。）

这种特殊情况可能看起来很人为，但实际上很常见。如果实数位于某个范围内并且已知“足够好”的精度水平，它也可以用作实数的近似值。这几乎适用于对一组“现实世界”对象进行的任何测量。例如，一群人的身高或体重。套装不够大？它对于地球上所有（单个）细菌的长度或重量也同样适用 - 假设有人可以提供数据！

看起来我误读了原文——它似乎需要一个滑动窗口中位数，而不是一个很长的流的中位数。这种方法仍然有效。加载初始窗口的前 N 个流值，然后对于第 N+1 个流值递增相应的 bin，同时递减与第 0 个流值对应的 bin。在这种情况下，有必要保留最后 N 个值以允许递减，这可以通过循环寻址大小为 N 的数组来有效地完成。由于中位数的位置只能改变 -2,-1,0,1 ,2 在滑动窗口的每个步骤中，无需将每个步骤中的所有 bin 求和至中值，只需根据修改了哪一侧 bin 来调整“中值指针”即可。例如，如果新值和被删除的值都低于当前中值，则它不会改变（偏移量 = 0）。当 N 变得太大而无法方便地保存在内存中时，该方法就会失败。

It is maybe worth pointing out that there is a special case which has a simple exact solution: when all the values in the stream are integers within a (relatively) small defined range. For instance, assume they must all lie between 0 and 1023. In this case just define an array of 1024 elements and a count, and clear all of these values. For each value in the stream increment the corresponding bin and the count. After the stream ends find the bin which contains the count/2 highest value - easily accomplished by adding successive bins starting from 0. Using the same method the value of an arbitrary rank order may be found. (There is a minor complication if detecting bin saturation and "upgrading" the size of the storage bins to a larger type during a run will be needed.)

This special case may seem artificial, but in practice it is very common. It can also be applied as an approximation for real numbers if they lie within a range and a "good enough" level of precision is known. This would hold for pretty much any set of measurements on a group of "real world" objects. For instance, the heights or weights of a group of people. Not a big enough set? It would work just as well for the lengths or weights of all the (individual) bacteria on the planet - assuming somebody could supply the data!

It looks like I misread the original - which seems like it wants a sliding window median instead of the just the median of a very long stream. This approach still works for that. Load the first N stream values for the initial window, then for the N+1th stream value increment the corresponding bin while decrementing the bin corresponding to the 0th stream value. It is necessary in this case to retain the last N values to allow the decrement, which can be done efficiently by cyclically addressing an array of size N. Since the position of the median can only change by -2,-1,0,1,2 on each step of the sliding window, it isn't necessary to sum all the bins up to the median on each step, just adjust the "median pointer" depending upon which side(s) bins were modified. For instance, if both the new value and the one being removed fall below the current median then it doesn't change (offset = 0). The method breaks down when N becomes too large to hold conveniently in memory.

回复收藏 0 原文

定格我的天空 2024-08-09 16:10:40

如果您能够将值作为时间点的函数进行参考，则可以应用引导以生成置信区间内的引导中值。与不断地将传入值排序到数据结构中相比，这可以让您更有效地计算近似中位数。

回复收藏 0 原文

饮湿 2024-08-09 16:10:40

对于那些需要 Java 运行中位数的人来说...PriorityQueue 是您的朋友。 O(log N) 插入，O(1) 当前中值，O(N) 删除。如果您知道数据的分布，您可以做得更好。

public class RunningMedian {
  // Two priority queues, one of reversed order.
  PriorityQueue<Integer> lower = new PriorityQueue<Integer>(10,
          new Comparator<Integer>() {
              public int compare(Integer arg0, Integer arg1) {
                  return (arg0 < arg1) ? 1 : arg0 == arg1 ? 0 : -1;
              }
          }), higher = new PriorityQueue<Integer>();

  public void insert(Integer n) {
      if (lower.isEmpty() && higher.isEmpty())
          lower.add(n);
      else {
          if (n <= lower.peek())
              lower.add(n);
          else
              higher.add(n);
          rebalance();
      }
  }

  void rebalance() {
      if (lower.size() < higher.size() - 1)
          lower.add(higher.remove());
      else if (higher.size() < lower.size() - 1)
          higher.add(lower.remove());
  }

  public Integer getMedian() {
      if (lower.isEmpty() && higher.isEmpty())
          return null;
      else if (lower.size() == higher.size())
          return (lower.peek() + higher.peek()) / 2;
      else
          return (lower.size() < higher.size()) ? higher.peek() : lower
                  .peek();
  }

  public void remove(Integer n) {
      if (lower.remove(n) || higher.remove(n))
          rebalance();
  }
}

For those who need a running median in Java...PriorityQueue is your friend. O(log N) insert, O(1) current median, and O(N) remove. If you know the distribution of your data you can do a lot better than this.

public class RunningMedian {
  // Two priority queues, one of reversed order.
  PriorityQueue<Integer> lower = new PriorityQueue<Integer>(10,
          new Comparator<Integer>() {
              public int compare(Integer arg0, Integer arg1) {
                  return (arg0 < arg1) ? 1 : arg0 == arg1 ? 0 : -1;
              }
          }), higher = new PriorityQueue<Integer>();

  public void insert(Integer n) {
      if (lower.isEmpty() && higher.isEmpty())
          lower.add(n);
      else {
          if (n <= lower.peek())
              lower.add(n);
          else
              higher.add(n);
          rebalance();
      }
  }

  void rebalance() {
      if (lower.size() < higher.size() - 1)
          lower.add(higher.remove());
      else if (higher.size() < lower.size() - 1)
          higher.add(lower.remove());
  }

  public Integer getMedian() {
      if (lower.isEmpty() && higher.isEmpty())
          return null;
      else if (lower.size() == higher.size())
          return (lower.peek() + higher.peek()) / 2;
      else
          return (lower.size() < higher.size()) ? higher.peek() : lower
                  .peek();
  }

  public void remove(Integer n) {
      if (lower.remove(n) || higher.remove(n))
          rebalance();
  }
}

回复收藏 0 原文

笨死的猪 2024-08-09 16:10:40

这是当精确输出不重要时可以使用的一个（用于显示目的等）
您需要总计数和最后中位数，加上新值。

{
totalcount++;
newmedian=lastmedian+(newvalue>lastmedian?1:-1)*(lastmedian==0?newvalue: lastmedian/totalcount*2);
}

为 page_display_time 等内容生成非常精确的结果。

规则：输入流需要在页面显示时间的顺序上平滑，数量大（> 30等），并且具有非零中值。

例子：
页面加载时间，800个项目，10ms...3000ms，平均90ms，真实中位数：11ms

经过30次输入，中位数误差一般<=20%（9ms..12ms），并且越来越小。
800 个输入后，误差为 +-2%。

另一位具有类似解决方案的思想家在这里：中值过滤器超高效实现

Here is one that can be used when exact output is not important (for display purposes etc.)
You need totalcount and lastmedian, plus the newvalue.

{
totalcount++;
newmedian=lastmedian+(newvalue>lastmedian?1:-1)*(lastmedian==0?newvalue: lastmedian/totalcount*2);
}

Produces quite exact results for things like page_display_time.

Rules: the input stream needs to be smooth on the order of page display time, big in count (>30 etc), and have a non zero median.

Example:
page load time, 800 items, 10ms...3000ms, average 90ms, real median:11ms

After 30 inputs, medians error is generally <=20% (9ms..12ms), and gets less and less.
After 800 inputs, the error is +-2%.

Another thinker with a similar solution is here: Median Filter Super efficient implementation

回复收藏 0 原文

放赐 2024-08-09 16:10:40

这是java实现

package MedianOfIntegerStream;

import java.util.Comparator;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Set;
import java.util.TreeSet;


public class MedianOfIntegerStream {

    public Set<Integer> rightMinSet;
    public Set<Integer> leftMaxSet;
    public int numOfElements;

    public MedianOfIntegerStream() {
        rightMinSet = new TreeSet<Integer>();
        leftMaxSet = new TreeSet<Integer>(new DescendingComparator());
        numOfElements = 0;
    }

    public void addNumberToStream(Integer num) {
        leftMaxSet.add(num);

        Iterator<Integer> iterMax = leftMaxSet.iterator();
        Iterator<Integer> iterMin = rightMinSet.iterator();
        int maxEl = iterMax.next();
        int minEl = 0;
        if (iterMin.hasNext()) {
            minEl = iterMin.next();
        }

        if (numOfElements % 2 == 0) {
            if (numOfElements == 0) {
                numOfElements++;
                return;
            } else if (maxEl > minEl) {
                iterMax.remove();

                if (minEl != 0) {
                    iterMin.remove();
                }
                leftMaxSet.add(minEl);
                rightMinSet.add(maxEl);
            }
        } else {

            if (maxEl != 0) {
                iterMax.remove();
            }

            rightMinSet.add(maxEl);
        }
        numOfElements++;
    }

    public Double getMedian() {
        if (numOfElements % 2 != 0)
            return new Double(leftMaxSet.iterator().next());
        else
            return (leftMaxSet.iterator().next() + rightMinSet.iterator().next()) / 2.0;
    }

    private class DescendingComparator implements Comparator<Integer> {
        @Override
        public int compare(Integer o1, Integer o2) {
            return o2 - o1;
        }
    }

    public static void main(String[] args) {
        MedianOfIntegerStream streamMedian = new MedianOfIntegerStream();

        streamMedian.addNumberToStream(1);
        System.out.println(streamMedian.getMedian()); // should be 1

        streamMedian.addNumberToStream(5);
        streamMedian.addNumberToStream(10);
        streamMedian.addNumberToStream(12);
        streamMedian.addNumberToStream(2);
        System.out.println(streamMedian.getMedian()); // should be 5

        streamMedian.addNumberToStream(3);
        streamMedian.addNumberToStream(8);
        streamMedian.addNumberToStream(9);
        System.out.println(streamMedian.getMedian()); // should be 6.5
    }
}

Here is the java implementation

package MedianOfIntegerStream;

import java.util.Comparator;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Set;
import java.util.TreeSet;


public class MedianOfIntegerStream {

    public Set<Integer> rightMinSet;
    public Set<Integer> leftMaxSet;
    public int numOfElements;

    public MedianOfIntegerStream() {
        rightMinSet = new TreeSet<Integer>();
        leftMaxSet = new TreeSet<Integer>(new DescendingComparator());
        numOfElements = 0;
    }

    public void addNumberToStream(Integer num) {
        leftMaxSet.add(num);

        Iterator<Integer> iterMax = leftMaxSet.iterator();
        Iterator<Integer> iterMin = rightMinSet.iterator();
        int maxEl = iterMax.next();
        int minEl = 0;
        if (iterMin.hasNext()) {
            minEl = iterMin.next();
        }

        if (numOfElements % 2 == 0) {
            if (numOfElements == 0) {
                numOfElements++;
                return;
            } else if (maxEl > minEl) {
                iterMax.remove();

                if (minEl != 0) {
                    iterMin.remove();
                }
                leftMaxSet.add(minEl);
                rightMinSet.add(maxEl);
            }
        } else {

            if (maxEl != 0) {
                iterMax.remove();
            }

            rightMinSet.add(maxEl);
        }
        numOfElements++;
    }

    public Double getMedian() {
        if (numOfElements % 2 != 0)
            return new Double(leftMaxSet.iterator().next());
        else
            return (leftMaxSet.iterator().next() + rightMinSet.iterator().next()) / 2.0;
    }

    private class DescendingComparator implements Comparator<Integer> {
        @Override
        public int compare(Integer o1, Integer o2) {
            return o2 - o1;
        }
    }

    public static void main(String[] args) {
        MedianOfIntegerStream streamMedian = new MedianOfIntegerStream();

        streamMedian.addNumberToStream(1);
        System.out.println(streamMedian.getMedian()); // should be 1

        streamMedian.addNumberToStream(5);
        streamMedian.addNumberToStream(10);
        streamMedian.addNumberToStream(12);
        streamMedian.addNumberToStream(2);
        System.out.println(streamMedian.getMedian()); // should be 5

        streamMedian.addNumberToStream(3);
        streamMedian.addNumberToStream(8);
        streamMedian.addNumberToStream(9);
        System.out.println(streamMedian.getMedian()); // should be 6.5
    }
}

回复收藏 0 原文

舂唻埖巳落 2024-08-09 16:10:40

基于 @mathog 的想法，这是一个 C# 实现，用于在具有已知值范围的字节数组上运行中位数。可以扩展到其他整数类型。

  /// <summary>
  /// Median estimation by histogram, avoids multiple sorting operations for a running median
  /// </summary>
  public class MedianEstimator
  {
    private readonly int m_size2;
    private readonly byte[] m_counts;

    /// <summary>
    /// Estimated median, available right after calling <see cref="Init"/> or <see cref="Update"/>.
    /// </summary>
    public byte Median { get; private set; }

    /// <summary>
    /// Ctor
    /// </summary>
    /// <param name="size">Median size in samples</param>
    /// <param name="maxValue">Maximum expected value in input data</param>
    public MedianEstimator(
      int size,
      byte maxValue)
    {
      m_size2 = size / 2;
      m_counts = new byte[maxValue + 1];
    }

    /// <summary>
    /// Initializes the internal histogram with the passed sample values
    /// </summary>
    /// <param name="values">Array of values, usually the start of the array for a running median</param>
    public void Init(byte[] values)
    {
      for (var i = 0; i < values.Length; i++)
        m_counts[values[i]]++;

      UpdateMedian();
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private void UpdateMedian()
    {
      // The median is the first value up to which counts add to size / 2
      var sum = 0;
      Median = 0;
      for (var i = 0; i < m_counts.Length; i++)
      {
        sum += m_counts[i];
        Median = (byte) i;
        if (sum > m_size2) break;
      }
    }

    /// <summary>
    /// Updates the median estimation by removing <paramref name="last"/> and adding <paramref name="next"/>. These
    /// values must be updated as the running median is applied. If the median length is <i>N</i>, at the sample
    /// <i>i</i>, <paramref name="last"/> is sample at index <i>i</i>-<i>N</i>/2 and <paramref name="next"/> is sample
    /// at index <i>i</i>+<i>N</i>/2+1.
    /// </summary>
    /// <param name="last">Sample at the start of the moving window that is to be removed</param>
    /// <param name="next">Sample at the end of the moving window + 1 that is to be added</param>
    public void Update(byte last, byte next)
    {
      m_counts[last]--;
      m_counts[next]++;

      // The conditions below do not change median value so there is no need to update it
      if (last == next ||
          last < Median && next < Median || // both below median
          last > Median && next > Median) // both above median
        return;

      UpdateMedian();
    }

根据运行中位数进行测试，并计时：

private void TestMedianEstimator()
{
  var r = new Random();
  const int SIZE = 15;
  const byte MAX_VAL = 80;
  var values = new byte[100000];
  for (var i = 0; i < values.Length; i++)
    values[i] = (byte) (MAX_VAL * r.NextDouble());

  var timer = Stopwatch.StartNew();
  // Running median
  var window = new byte[2 * SIZE + 1];
  var medians = new byte[values.Length];
  for (var i = SIZE; i < values.Length - SIZE - 1; i++)
  {
    for (int j = i - SIZE, k = 0; j <= i + SIZE; j++, k++)
      window[k] = values[j];
    Array.Sort(window);
    medians[i] = window[SIZE];
  }
  timer.Stop();
  var elapsed1 = timer.Elapsed;

  timer.Restart();
  var me = new MedianEstimator(2 * SIZE + 1, MAX_VAL);
  me.Init(values.Slice(0, 2 * SIZE + 1));
  var meMedians = new byte[values.Length];
  for (var i = SIZE; i < values.Length - SIZE - 1; i++)
  {
    meMedians[i] = me.Median;
    me.Update(values[i - SIZE], values[i + SIZE + 1]);
  }
  timer.Stop();
  var elapsed2 = timer.Elapsed;

  WriteLineToLog($"{elapsed1.TotalMilliseconds / elapsed2.TotalMilliseconds:0.00}");

  var diff = 0;
  for (var i = 0; i < meMedians.Length; i++)
    diff += Math.Abs(meMedians[i] - medians[i]);
  WriteLineToLog($"Diff: {diff}");
}

Based on @mathog thoughts, this is a C# implementation for a running median over an array of bytes with known range of values. Can be extended to other integer types.

  /// <summary>
  /// Median estimation by histogram, avoids multiple sorting operations for a running median
  /// </summary>
  public class MedianEstimator
  {
    private readonly int m_size2;
    private readonly byte[] m_counts;

    /// <summary>
    /// Estimated median, available right after calling <see cref="Init"/> or <see cref="Update"/>.
    /// </summary>
    public byte Median { get; private set; }

    /// <summary>
    /// Ctor
    /// </summary>
    /// <param name="size">Median size in samples</param>
    /// <param name="maxValue">Maximum expected value in input data</param>
    public MedianEstimator(
      int size,
      byte maxValue)
    {
      m_size2 = size / 2;
      m_counts = new byte[maxValue + 1];
    }

    /// <summary>
    /// Initializes the internal histogram with the passed sample values
    /// </summary>
    /// <param name="values">Array of values, usually the start of the array for a running median</param>
    public void Init(byte[] values)
    {
      for (var i = 0; i < values.Length; i++)
        m_counts[values[i]]++;

      UpdateMedian();
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private void UpdateMedian()
    {
      // The median is the first value up to which counts add to size / 2
      var sum = 0;
      Median = 0;
      for (var i = 0; i < m_counts.Length; i++)
      {
        sum += m_counts[i];
        Median = (byte) i;
        if (sum > m_size2) break;
      }
    }

    /// <summary>
    /// Updates the median estimation by removing <paramref name="last"/> and adding <paramref name="next"/>. These
    /// values must be updated as the running median is applied. If the median length is <i>N</i>, at the sample
    /// <i>i</i>, <paramref name="last"/> is sample at index <i>i</i>-<i>N</i>/2 and <paramref name="next"/> is sample
    /// at index <i>i</i>+<i>N</i>/2+1.
    /// </summary>
    /// <param name="last">Sample at the start of the moving window that is to be removed</param>
    /// <param name="next">Sample at the end of the moving window + 1 that is to be added</param>
    public void Update(byte last, byte next)
    {
      m_counts[last]--;
      m_counts[next]++;

      // The conditions below do not change median value so there is no need to update it
      if (last == next ||
          last < Median && next < Median || // both below median
          last > Median && next > Median) // both above median
        return;

      UpdateMedian();
    }

Testing against a running median, with timing:

private void TestMedianEstimator()
{
  var r = new Random();
  const int SIZE = 15;
  const byte MAX_VAL = 80;
  var values = new byte[100000];
  for (var i = 0; i < values.Length; i++)
    values[i] = (byte) (MAX_VAL * r.NextDouble());

  var timer = Stopwatch.StartNew();
  // Running median
  var window = new byte[2 * SIZE + 1];
  var medians = new byte[values.Length];
  for (var i = SIZE; i < values.Length - SIZE - 1; i++)
  {
    for (int j = i - SIZE, k = 0; j <= i + SIZE; j++, k++)
      window[k] = values[j];
    Array.Sort(window);
    medians[i] = window[SIZE];
  }
  timer.Stop();
  var elapsed1 = timer.Elapsed;

  timer.Restart();
  var me = new MedianEstimator(2 * SIZE + 1, MAX_VAL);
  me.Init(values.Slice(0, 2 * SIZE + 1));
  var meMedians = new byte[values.Length];
  for (var i = SIZE; i < values.Length - SIZE - 1; i++)
  {
    meMedians[i] = me.Median;
    me.Update(values[i - SIZE], values[i + SIZE + 1]);
  }
  timer.Stop();
  var elapsed2 = timer.Elapsed;

  WriteLineToLog(quot;{elapsed1.TotalMilliseconds / elapsed2.TotalMilliseconds:0.00}");

  var diff = 0;
  for (var i = 0; i < meMedians.Length; i++)
    diff += Math.Abs(meMedians[i] - medians[i]);
  WriteLineToLog(quot;Diff: {diff}");
}

回复收藏 0 原文