在简单的线性数据集中查找并修复错误值
这可能是一个简单的问题,但我找不到好的方法。
我有有限数量的有序 int 值,它们之间的距离应该相似,例如:32, 42, 52, 62, 72, 82
。
但实际上,有些价值观是错误的。我们最终可能会得到 32, 51, 62, 66, 71, 83
。
我怎样才能找到明显错误的值(在本例中:66)并将其移动到正确的位置(42)?
- 可以假设大多数数据仍然有效,因此仍然可以计算出点之间正确距离的良好猜测(此处:10)。
- 点的数量是已知且正确的(即,我们只需要移动,而不需要添加或删除点)。
- 左侧和右侧的数据边界是未知的,边缘情况下的行为可以自由定义。
在写问题的时候我想到了一些事情。一个想法可能是提取一个函数 f(x) = a + x * b(这很简单)并迭代已知的点数。删除与迭代点距离最大的数据并将其插入到与原始点距离最大的迭代位置。
This is probably a simple question yet I could not find a good approach.
I've got a limited number of ordered int values that are supposed to be of similar distance to each other, e.g: 32, 42, 52, 62, 72, 82
.
In reality though, some values are wrong. We might end up with 32, 51, 62, 66, 71, 83
.
How can I find the obviously wrong value (in this case: 66) and move it to the correct position (42)?
- It can be assumed that most data are still valid so it is still possible to calculate a good guess of the correct distance between points (here: 10).
- The number of points is known and correct (i.e., we just need to move but not add or remove points).
- The data boundaries to the left and to the right are unknown, behavior in edge cases can be defined freely.
While writing the question I thought of something. An idea might be to extract a function f(x) = a + x * b
(that's easy) and iterate over the known number of points. The datum with the largest distance to an iterated point is removed and inserted at the iterated position which has the largest distance to an original point.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您可以使用稳健回归,这只不过是“将直线拟合到一堆点,以这样的方式,不适合的点被优雅地删除”。
如果您不想编写非线性优化代码,可以使用迭代重新加权最小二乘法 利用您现有的任何现有加权线性回归代码。
这个想法是,您可以使用加权最小二乘来将直线拟合到您的点上。然后,您为每个点分配一个权重,以衡量您是否认为它是偏离回归的异常值行太多(例如,通过 Huber 损失函数)。然后,您可以使用权重重新进行回归。您将得到一条新线,因此可以计算一组新的权重。重复直到收敛(或最大迭代次数)。您将得到权重,告诉您哪些点不好,以及一条与其余点很好地拟合的线,并且可以用这条线来替换异常值。
我认为实现并不比上面的文字描述长很多。
You can use robust regression, which is nothing more than a fancy term for "fitting a straight line to a bunch of points in such a way that points that don't fit well are gracefully removed".
If you don't want to write the non-linear optimization code, you can use iteratively reweighted least squares to leverage any existing weighted linear regression code you have lying around.
The idea is that you do weighted least squares to fit a straight line to your points. You then assign a weighting to each point that measures whether you think it's an outlier, deviating from the regression line too much (eg. via the Huber loss function). You then redo the regression with the weights. You'll get a new line and therefore can compute a new set of weights. Repeat until convergence (or a max number of iterations). You'll be left with weights that tell you which points are bad, and a line that nicely fits the remaining points and which can be used to replace the outliers.
I think the implementation isn't vastly longer than the text description above.
如果只有一个数据是错误的,并且假设值不断增加(如您的示例所示):
数据在DATA和DATA_SIZE中,THRESHOLD是允许的偏差
If only one datum is wrong, and assuming increasing values (as in your example):
The data goes in DATA and DATA_SIZE, and THRESHOLD is the deviation allowed
我尝试了很多不同的方法,这就是我最终的结果。基本思想是将好的、有效的值分配给期望值数组。通过使用缺失的预期值来修复无法分配的值。
给出的是实际数据
峰值
的列表。构建预期数据列表
构建所有点距离的矩阵
重复
选择最佳距离
如果没有找到好的分配,则退出
否则,存储匹配
我们源中的所有有效条目均已被识别并删除。我们只需使用预期集中的剩余值并忽略剩余的原始值即可完成最终的数据集。
其他尝试过的方法,包括改编自 gusbro 的版本,效果不太好,而且经常对我表现出不良行为。
I experimented with a lot of different approaches, this is what I ended up with. The basic idea is to assign good, valid values to the array of expected values. Values that could not be assigned get fixed by using the missing expected values instead.
Given is a list of actual data
peaks
.Build a list of expected data
Build a matrix of the distances of all points
Repeat
Select the best distance
If no good assignation was found, quit
Else, store the match
All valid entries in our source have been identified and removed. We simply use the left-over values in the expected set and ignore the left-over original values to finish our final data set.
Other attempted approaches, including a version adapted from gusbro's, worked less well and often displayed bad behavior for me.
我将尝试概述一个算法(我不知道它是否会为每个输入序列给出正确的结果,因此将其视为一个想法):
算法的输入是有序序列
R
。例如 { 32, 51, 62, 66, 71, 83 }查找点之间的距离
d
。我在想:排序后的差异 = { 4, 5, 11, 12, 19 } -->中位数 = 11
平均值=10.2 -->舍入平均值 = 10
构建
R
元素的平均值m
。在我们的示例中 (32 + 51 + 62 + 66 + 71 + 83) / 6 = 30.2
Rounded = 30
构建比较序列
S
,其中第一个元素S_0
的值为m - (n / 2) * d
(其中n
是元素数量),任何其他元素S_i
的值为S_1 + i * d
。在我们的示例中
S
= { 30, 40, 50, 60, 70, 80 }因为输入序列中的元素可能已移动到另一个位置,
构建
R
的每个排列
查找离群值数量最少的排列(离群值是元素,其中元素差异较大
0.3 * d
本例中算法的结果将是排列 y 以及它的正确位置找到了 66 号元素。
I will try to outline an algorithm (I don't know if it would give a correct result for every input sequence, therefor think of it as an idea):
Input for the algorithm is the ordered sequence
R
. For Example { 32, 51, 62, 66, 71, 83 }Find distance
d
between points. I'm thinking of:Sorted differences = { 4, 5, 11, 12, 19 } --> Median = 11
Mean Value = 10.2 --> Rounded Mean Value = 10
Build the mean value
m
of the elements ofR
.In our example (32 + 51 + 62 + 66 + 71 + 83) / 6 = 30.2
Rounded = 30
Build a comparative squence
S
where the first elementS_0
has the valuem - (n / 2) * d
(wheren
is the number of elements) and any further elementS_i
has the valueS_1 + i * d
.In our example
S
= { 30, 40, 50, 60, 70, 80 }Because the elements in the input sequence could have moved to another position,
build every permutation of
R
Find the permutation where the number of outliers is minimal (outlier is element, where element difference is greater
0.3 * d
The result of the algorithm in this example would be permutation y and with it the correct position of the element 66 is found.