当前位置：文江博客话题详情

在简单的线性数据集中查找并修复错误值

发布于 2024-11-09 16:00:09 字数 480 浏览 7 评论 0原文

这可能是一个简单的问题，但我找不到好的方法。

我有有限数量的有序 int 值，它们之间的距离应该相似，例如：32, 42, 52, 62, 72, 82。

但实际上，有些价值观是错误的。我们最终可能会得到 32, 51, 62, 66, 71, 83。

我怎样才能找到明显错误的值（在本例中：66）并将其移动到正确的位置（42）？

可以假设大多数数据仍然有效，因此仍然可以计算出点之间正确距离的良好猜测（此处：10）。
点的数量是已知且正确的（即，我们只需要移动，而不需要添加或删除点）。
左侧和右侧的数据边界是未知的，边缘情况下的行为可以自由定义。

在写问题的时候我想到了一些事情。一个想法可能是提取一个函数 f(x) = a + x * b（这很简单）并迭代已知的点数。删除与迭代点距离最大的数据并将其插入到与原始点距离最大的迭代位置。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

昔梦 2024-11-16 16:00:09

您可以使用稳健回归，这只不过是“将直线拟合到一堆点，以这样的方式，不适合的点被优雅地删除”。

如果您不想编写非线性优化代码，可以使用迭代重新加权最小二乘法利用您现有的任何现有加权线性回归代码。

这个想法是，您可以使用加权最小二乘来将直线拟合到您的点上。然后，您为每个点分配一个权重，以衡量您是否认为它是偏离回归的异常值行太多（例如，通过 Huber 损失函数）。然后，您可以使用权重重新进行回归。您将得到一条新线，因此可以计算一组新的权重。重复直到收敛（或最大迭代次数）。您将得到权重，告诉您哪些点不好，以及一条与其余点很好地拟合的线，并且可以用这条线来替换异常值。

我认为实现并不比上面的文字描述长很多。

回复收藏 0 原文

百变从容 2024-11-16 16:00:09

如果只有一个数据是错误的，并且假设值不断增加（如您的示例所示）：
数据在DATA和DATA_SIZE中，THRESHOLD是允许的偏差

#include <stdio.h>
#define THRESHOLD 3

#define DATA 32, 51, 62, 66, 71, 83
#define DATA_SIZE 6
void main()
{
    int data[]={DATA}; int size = DATA_SIZE;
    int skip = 0, diffs, curDif, maxDif, lastItem, item, dif, maxPos;
    int maxDiffs = 10000, location, newPosition, newValue;
    for(skip = 0; skip < size; skip++)
    {
      diffs = 0;
      curDif = 0;
      maxDif = 0;
      maxPos = -1;
      lastItem = (skip == 0);
      for(item = lastItem+1; item < size; item++)
      {
        if(item == skip)continue;
        dif = data[item]-data[lastItem];
        if(abs(dif - curDif) > THRESHOLD)
        {
          curDif = dif;
          diffs++;
          if(curDif > maxDif)
          {
            maxDif = curDif;
            maxPos = item;
          }
        }
        lastItem = item;
      }

      if(diffs < maxDiffs)
      {
          maxDiffs = diffs;
          location = skip;
          newPosition = maxPos;
          newValue = data[maxPos-1]+(maxDif>>1);
      }
    }
    printf("Found... \nindex %d\nValue: %d\nGoes in:%d\nNew value:%d\n", location, data[location], newPosition, newValue);
}

If only one datum is wrong, and assuming increasing values (as in your example):
The data goes in DATA and DATA_SIZE, and THRESHOLD is the deviation allowed

#include <stdio.h>
#define THRESHOLD 3

#define DATA 32, 51, 62, 66, 71, 83
#define DATA_SIZE 6
void main()
{
    int data[]={DATA}; int size = DATA_SIZE;
    int skip = 0, diffs, curDif, maxDif, lastItem, item, dif, maxPos;
    int maxDiffs = 10000, location, newPosition, newValue;
    for(skip = 0; skip < size; skip++)
    {
      diffs = 0;
      curDif = 0;
      maxDif = 0;
      maxPos = -1;
      lastItem = (skip == 0);
      for(item = lastItem+1; item < size; item++)
      {
        if(item == skip)continue;
        dif = data[item]-data[lastItem];
        if(abs(dif - curDif) > THRESHOLD)
        {
          curDif = dif;
          diffs++;
          if(curDif > maxDif)
          {
            maxDif = curDif;
            maxPos = item;
          }
        }
        lastItem = item;
      }

      if(diffs < maxDiffs)
      {
          maxDiffs = diffs;
          location = skip;
          newPosition = maxPos;
          newValue = data[maxPos-1]+(maxDif>>1);
      }
    }
    printf("Found... \nindex %d\nValue: %d\nGoes in:%d\nNew value:%d\n", location, data[location], newPosition, newValue);
}

回复收藏 0 原文

心不设防 2024-11-16 16:00:09

我尝试了很多不同的方法，这就是我最终的结果。基本思想是将好的、有效的值分配给期望值数组。通过使用缺失的预期值来修复无法分配的值。

给出的是实际数据峰值的列表。

构建预期数据列表

var expected = Enumerable
    // 19 is the known number of values
    .Range (0, 19)
    // simply interpolate over the actual data
    .Select (x => peaks.First () + x * (peaks.Last () - peaks.First ()) / 18)
    .ToList ();

构建所有点距离的矩阵

var distances = expected.SelectMany (dst => peaks.Select (src => new {
    Expected = dst,
    Original = src,
    Distance = Math.Abs (dst - src)
}));

重复

for (;;)
{

选择最佳距离

var best = distances
    // ignore really bad values
    .Where (x => x.Distance < dAvgAll * 0.3)
    .OrderBy (x => x.Distance).FirstOrDefault ();

如果没有找到好的分配，则退出

if (best == null) {
    break;
}

否则，存储匹配

expected.Remove (best.Expected);
peaks.Remove (best.Original);

}

我们源中的所有有效条目均已被识别并删除。我们只需使用预期集中的剩余值并忽略剩余的原始值即可完成最终的数据集。

其他尝试过的方法，包括改编自 gusbro 的版本，效果不太好，而且经常对我表现出不良行为。

I experimented with a lot of different approaches, this is what I ended up with. The basic idea is to assign good, valid values to the array of expected values. Values that could not be assigned get fixed by using the missing expected values instead.

Given is a list of actual data peaks.

Build a list of expected data

var expected = Enumerable
    // 19 is the known number of values
    .Range (0, 19)
    // simply interpolate over the actual data
    .Select (x => peaks.First () + x * (peaks.Last () - peaks.First ()) / 18)
    .ToList ();

Build a matrix of the distances of all points

var distances = expected.SelectMany (dst => peaks.Select (src => new {
    Expected = dst,
    Original = src,
    Distance = Math.Abs (dst - src)
}));

Repeat

for (;;)
{

Select the best distance

var best = distances
    // ignore really bad values
    .Where (x => x.Distance < dAvgAll * 0.3)
    .OrderBy (x => x.Distance).FirstOrDefault ();

If no good assignation was found, quit

if (best == null) {
    break;
}

Else, store the match

expected.Remove (best.Expected);
peaks.Remove (best.Original);

}

All valid entries in our source have been identified and removed. We simply use the left-over values in the expected set and ignore the left-over original values to finish our final data set.

Other attempted approaches, including a version adapted from gusbro's, worked less well and often displayed bad behavior for me.

回复收藏 0 原文

作业与我同在 2024-11-16 16:00:09

我将尝试概述一个算法（我不知道它是否会为每个输入序列给出正确的结果，因此将其视为一个想法）：

算法的输入是有序序列R 。例如 { 32, 51, 62, 66, 71, 83 }

查找点之间的距离 d。我在想：
- 对元素之间的差异进行排序并取中位数。
  排序后的差异 = { 4, 5, 11, 12, 19 } -->中位数 = 11
- 或者计算差异的平均值。
  平均值=10.2 -->舍入平均值 = 10
构建 R 元素的平均值 m。
在我们的示例中 (32 + 51 + 62 + 66 + 71 + 83) / 6 = 30.2
Rounded = 30
构建比较序列 S，其中第一个元素 S_0 的值为
m - (n / 2) * d（其中 n 是元素数量），任何其他元素 S_i 的值为 S_1 + i * d。
在我们的示例中 S = { 30, 40, 50, 60, 70, 80 }
因为输入序列中的元素可能已移动到另一个位置，
构建 R
的每个排列
查找离群值数量最少的排列（离群值是元素，其中元素差异较大 0.3 * d

                     S = { 30, 40, 50, 60, 70, 80 } 
    permutation x of R = { 32, 51, 62, 66, 71, 83 } three outliers
    permutation y of R = { 32, 66, 51, 62, 71, 83 } one outlier
    permutation z of R = ...

本例中算法的结果将是排列 y 以及它的正确位置找到了 66 号元素。

I will try to outline an algorithm (I don't know if it would give a correct result for every input sequence, therefor think of it as an idea):

Input for the algorithm is the ordered sequence R. For Example { 32, 51, 62, 66, 71, 83 }

Find distance d between points. I'm thinking of:
- Sort the differences between the elements and take the median.
  Sorted differences = { 4, 5, 11, 12, 19 } --> Median = 11
- Or calculate the mean value of the differences.
  Mean Value = 10.2 --> Rounded Mean Value = 10
Build the mean value m of the elements of R.
In our example (32 + 51 + 62 + 66 + 71 + 83) / 6 = 30.2
Rounded = 30
Build a comparative squence S where the first element S_0 has the value
m - (n / 2) * d (where n is the number of elements) and any further element S_i has the value S_1 + i * d.
In our example S = { 30, 40, 50, 60, 70, 80 }
Because the elements in the input sequence could have moved to another position,
build every permutation of R
Find the permutation where the number of outliers is minimal (outlier is element, where element difference is greater 0.3 * d

                     S = { 30, 40, 50, 60, 70, 80 } 
    permutation x of R = { 32, 51, 62, 66, 71, 83 } three outliers
    permutation y of R = { 32, 66, 51, 62, 71, 83 } one outlier
    permutation z of R = ...

The result of the algorithm in this example would be permutation y and with it the correct position of the element 66 is found.

回复收藏 0 原文

~没有更多了~