我应该如何过滤这些数据?

发布于 2024-09-24 13:15:10 字数 376 浏览 14 评论 0原文

我有几个系列的数据点需要绘制图表。对于每个图表,一些点可能由于错误而需要被丢弃。示例如下: alt text

圈出的区域是数据中的错误。

我需要的是一种过滤这些数据的算法,以便通过用平线替换坏点来消除错误,如下所示:

 alt text

有没有特别擅长检测错误点的算法?您有什么建议可以指引我正确的方向吗?

编辑:错误点是任何看起来与双方数据不一致的点。可以有很大的跳跃,只要跳跃后的数据看起来仍然一致。如果它位于图表的边缘,则较大的跳跃可能应被视为错误。

I have a several series of data points that need to be graphed. For each graph, some points may need to be thrown out due to error. An example is the following:
alt text

The circled areas are errors in the data.

What I need is an algorithm to filter this data so that it eliminates the error by replacing the bad points with flat lines, like so:

alt text

Are there any algorithms out there that are especially good at detecting error points? Do you have any tips that could point me in the right direction?

EDIT: Error points are any points that don't look consistent with the data on both sides. There can be large jumps, as long as the data after the jump still looks consistent. If it's on the edge of the graph, large jumps should probably be considered error.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

昨迟人 2024-10-01 13:15:10

这是一个很难用通用方法解决的问题;您的最终解决方案将非常依赖于流程,并且对于您的情况而言是独一无二的。

话虽这么说,您需要首先了解数据:从一个样本到下一个样本,可能会出现什么样的变化?使用它,您可以使用以前的数据样本(也可能是未来的数据样本)来确定当前样本是否是伪造的。然后,您最终会得到一个看起来像这样的过滤器:

    const int MaxQueueLength = 100;           // adjust these two values as necessary
    const double MaxProjectionError = 5;

    List<double> FilterData(List<double> rawData)
    {
        List<double> toRet = new List<double>(rawData.Count);
        Queue<double> history = new Queue<double>(MaxQueueLength); // adjust queue length as necessary
        foreach (double raw_Sample in rawData)
        {
            while (history.Count > MaxQueueLength)
                history.Dequeue();
            double ProjectedSample = GuessNext(history, raw_Sample);
            double CurrentSample = (Math.Abs(ProjectedSample - raw_Sample) > MaxProjectionError) ? ProjectedSample : raw_Sample;
            toRet.Add(CurrentSample);
            history.Enqueue(CurrentSample);
        }
        return toRet;
    }

那么,神奇的是您的 GuessNext 函数。在这里,您将了解特定于您的情况的内容,并且应该考虑您所了解的有关收集数据的过程的所有信息。输入变化的速度是否存在物理限制?您的数据是否存在可以轻松过滤的已知错误值?

这是 GuessNext 函数的一个简单示例,该函数根据数据的一阶导数进行工作(即,当您只查看数据的一小部分时,它假设您的数据大致是一条直线)

double lastSample = double.NaN;
double GuessNext(Queue<double> history, double nextSample)
{
    lastSample = double.IsNaN(lastSample) ? nextSample : lastSample;
    //ignore the history for simple first derivative.  Assume that input will always approximate a straight line
    double toRet = (nextSample + (nextSample - lastSample));
    lastSample = nextSample;
    return toRet;
}

如果您的数据特别嘈杂,在将其传递给 GuessNext 之前,您可能需要对其应用平滑过滤器。您只需要花一些时间研究算法即可得出对您的数据有意义的东西。

您的示例数据似乎是参数化的,因为每个样本都定义了 X 和 Y 值。您也许能够将上述逻辑独立地应用于每个维度,如果只有一个维度给您带来错误的数字,那么这将是合适的。例如,在一个维度是时间戳且时间戳有时是伪造的情况下,这可能特别成功。

This is a problem that is hard to solve generically; your final solution will end up being very process-dependent, and unique to your situation.

That being said, you need to start by understanding your data: from one sample to the next, what kind of variation is possible? Using that, you can use previous data samples (and maybe future data samples) to decide if the current sample is bogus or not. Then, you'll end up with a filter that looks something like:

    const int MaxQueueLength = 100;           // adjust these two values as necessary
    const double MaxProjectionError = 5;

    List<double> FilterData(List<double> rawData)
    {
        List<double> toRet = new List<double>(rawData.Count);
        Queue<double> history = new Queue<double>(MaxQueueLength); // adjust queue length as necessary
        foreach (double raw_Sample in rawData)
        {
            while (history.Count > MaxQueueLength)
                history.Dequeue();
            double ProjectedSample = GuessNext(history, raw_Sample);
            double CurrentSample = (Math.Abs(ProjectedSample - raw_Sample) > MaxProjectionError) ? ProjectedSample : raw_Sample;
            toRet.Add(CurrentSample);
            history.Enqueue(CurrentSample);
        }
        return toRet;
    }

The magic, then, is coming up with your GuessNext function. Here, you'll be getting into stuff that is specific to your situation, and should take into account everything you know about the process that is gathering data. Are there physical limits to how quickly the input can change? Does your data have known bad values you can easily filter?

Here is a simple example for a GuessNext function that works off of the first derivative of your data (i.e. it assumes that your data is a roughly a straight line when you only look at a small section of it)

double lastSample = double.NaN;
double GuessNext(Queue<double> history, double nextSample)
{
    lastSample = double.IsNaN(lastSample) ? nextSample : lastSample;
    //ignore the history for simple first derivative.  Assume that input will always approximate a straight line
    double toRet = (nextSample + (nextSample - lastSample));
    lastSample = nextSample;
    return toRet;
}

If your data is particularly noisy, you may want to apply a smoothing filter to it before you pass it to GuessNext. You'll just have to spend some time with the algorithm to come up with something that makes sense for your data.

Your example data appears to be parametric in that each sample defines both a X and a Y value. You might be able to apply the above logic to each dimension independently, which would be appropriate if only one dimension is the one giving you bad numbers. This can be particularly successful in cases where one dimension is a timestamp, for instance, and the timestamp is occasionally bogus.

夏末 2024-10-01 13:15:10

如果无法通过肉眼去除异常值,请尝试克里金法(带有误差项),如 http://www.ipf.tuwien.ac.at/cb/publications/pipeline.pdf 。这似乎可以很好地自动处理偶尔的极端噪音。我知道法国气象学家使用这种方法来消除数据中的异常值(例如温度传感器旁边的火灾或踢风传感器的物体)。

请注意,这通常是一个难题。有关错误的任何信息都是宝贵的。有人踢了测量装置吗?那么除了手动删除有问题的数据之外,您无能为力。你的噪音是系统性的吗?然后你可以通过做出(合理的)假设来做很多事情。

If removing the outliers by eye is not possible, try kriging (with error terms) as in http://www.ipf.tuwien.ac.at/cb/publications/pipeline.pdf . This seems to work quite well to automatically deal with occasional extreme noise. I know that French meteorologists use such an approach to remove outliers in their data (like a fire next to a temperature sensor or something kicking a wind sensor for instance).

Please note that it is a difficult problem in general. Any information about the errors is precious. Did someone kick the measuring device ? Then you cannot do much except removing the offending data by hand. Is your noise systematic ? You can do a lot of things then by making (reasonable) hypotheses about it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文