识别Numpy数组中移动对象的位置离群值

发布于 2025-01-28 07:01:53 字数 2473 浏览 3 评论 0原文

我有一个数据集（例如，可移动对象的X-或Y位置）。物体随着时间的流逝而移动，假设线性地说。连续位置之间的距离在一定范围内（例如1 +/- 2.0 STD）。现在，由于数据工件的跳跃可能会发生，例如，由于溢出，某些位置可能会跳至完全不同的位置，这显然与众不同。

我想确定由这些文物影响的位置阵列中的元素。

考虑以下位置与某些噪声线性生长：

import numpy as np

linear_movement = np.arange(0, 100, 1)
noise = np.random.normal(loc = 0.0, scale = 2.0, size = linear_movement.size)
positions = linear_movement + noise
positions[78] = positions[78]+385

这里的位置78受伪影的影响。

由于“位置”没有分布在固定位置上，并且数据可能会在整个运动过程中有所不同，以便以后定期到达离群位置（例如，如果我根据np.arange从0到1000（如果我从0到1000） 1000，1））我不能简单地基于中位数 +一些偏移来整理职位（例如： https：/ /stackoverflow.com/a/16562028 ）。

我宁愿看看连续位置之间的相互距离用于识别异常值：

distance = np.diff(positions)

第一个问题（我可以以一种肮脏的方式进行编码，如果只有单个离群值）：

在距离阵列中，原始位置中的1个异常值阵列会产生2个异常值。

此外，，例如连续4个异常值时，这些位置之间的距离阵列会声称一切正常：

import numpy as np
import matplotlib.pyplot as plt

linear_movement = np.arange(0, 100, 1)
noise = np.random.normal(loc = 0.0, scale = 2.0, size = linear_movement.size)
positions = linear_movement + noise
positions[78:82] = positions[78:82] + 385

'draw'
plt.figure()
plt.plot(positions)

distance = np.diff(positions)
distance.astype(int)

输出：

Out[264]: 
array([   0,    1,    1,    1,    2,   -3,    3,   -2,    5,    0,    0,
          0,    1,    1,   -1,    3,   -3,    4,    1,    1,    0,    0,
          1,    1,   -1,    4,   -4,    1,    1,    4,    0,    2,    0,
          0,    1,    1,    2,    0,    0,    0,    3,   -3,    3,    2,
          0,    0,    0,    2,    2,    1,   -3,    5,    0,    3,   -1,
          0,    2,   -2,    2,    3,    1,   -3,    0,    4,    0,    6,
          0,   -3,    2,    3,   -3,    3,   -1,    1,    4,   -1,    3,
        382,    0,    2,   -3, -377,    0,    0,    3,    0,    2,    0,
          0,    1,   -2,    3,    0,    0,    2,    2,    5,   -4,    4])

我已经注意到的东西：

每一个“ “在距离阵列中位置阵列返回“正常” ...（除特殊情况外位置数组启动或以“异常值”结尾）
当有多个连续异常值时，彼此之间的距离异常值本身是不起眼的，这使得识别它们更难。

是否有一种聪明的方法甚至精确的功能可以照顾这样的事情？根据我的经验，我经常会使问题比实际要复杂得多...

我可以想到大数的指标，拿走第二个元素（和第二个+1），并切成位置根据这些数组...但这似乎很混乱，并且再次需要特殊案例才能以异常值的起点和结尾。

最好的

原文

I have a data set of positions (e.g. the x- or y-position of a movable object).
The object moves over time, let's say linearly. The distance between consecutive positions is within a certain range (e.g. 1 +/- 2.0 std).
Now due to data artifacts jumps may occur, for example due to overflow some positions may jump to a whole different position which is clearly out of the ordinary.

I would like to identify the elements in my positions array that are affected by these artifacts.

Consider the following positions which grow linearly with some noise:

import numpy as np

linear_movement = np.arange(0, 100, 1)
noise = np.random.normal(loc = 0.0, scale = 2.0, size = linear_movement.size)
positions = linear_movement + noise
positions[78] = positions[78]+385

Here position 78 is affected by an artifact.

Since 'positions' is not distributed about a fixed position and data could vary over the course of the movement such that outlier positions are reached regularly later on (e.g. if I went from 0 to 1000 according to np.arange(0, 1000, 1)) I can't simply sort out positions based on a median + some offset (as e.g. here: https://stackoverflow.com/a/16562028 ).

I would rather take a look at the mutual distance between consecutive positions to use for the identification of outliers:

distance = np.diff(positions)

First problem (which I could code around in a dirty way I suppose if there where only single outliers):

In the distance array 1 outlier in the original positions array produces 2 outliers.

Moreover, when there are e.g. 4 consecutive outliers, the distance array in between those positions will be claiming everything is normal:

import numpy as np
import matplotlib.pyplot as plt

linear_movement = np.arange(0, 100, 1)
noise = np.random.normal(loc = 0.0, scale = 2.0, size = linear_movement.size)
positions = linear_movement + noise
positions[78:82] = positions[78:82] + 385

'draw'
plt.figure()
plt.plot(positions)

distance = np.diff(positions)
distance.astype(int)

Output:

Out[264]: 
array([   0,    1,    1,    1,    2,   -3,    3,   -2,    5,    0,    0,
          0,    1,    1,   -1,    3,   -3,    4,    1,    1,    0,    0,
          1,    1,   -1,    4,   -4,    1,    1,    4,    0,    2,    0,
          0,    1,    1,    2,    0,    0,    0,    3,   -3,    3,    2,
          0,    0,    0,    2,    2,    1,   -3,    5,    0,    3,   -1,
          0,    2,   -2,    2,    3,    1,   -3,    0,    4,    0,    6,
          0,   -3,    2,    3,   -3,    3,   -1,    1,    4,   -1,    3,
        382,    0,    2,   -3, -377,    0,    0,    3,    0,    2,    0,
          0,    1,   -2,    3,    0,    0,    2,    2,    5,   -4,    4])

Things I have noted:

Every second "big number" in the distance array things in the
positions array return to "normal"... (apart from special cases with
positions array starting or ending with "outliers")
When there are multiple consecutive outliers, distances in between
the outliers itself are inconspicuous which makes identifying them
harder.

Is there a smart way or even precoded function that would take care of something like this?
In my experience I am often times making the problem much more complicated than it really is ...

I could think of noting down the indices of the big numbers, take every second element (and second +1) of that indices and slice the positions array according to those... but that seems messy and again would need special cases for starting and ending with outliers.

Best

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我纯我任性 2025-02-04 07:01:53

这取决于数据的复杂性...您可以查看单个粒子跟踪的字段。他们开发了许多算法来遵循粒子的轨迹并检测异常值。

如果您的错误通常看起来像您的示例，那么一个简单的情况，您可以使用Numpy PolyFit估算线性轨迹。但是拟合将倾向于过度调整异常值。然后，我建议使用scipy。仅在绝对norm1上（不是距离而是其平方根）上的函数。将数据与数据的合适差异确实超出了异常值。然后，您可以使用阈值将它们分开。（OTSU阈值也许？）。这在最终直方图上很明显，两组存在。

distance = np.diff(positions).astype('int')



def to_minimize(parameter, time,position):
    return np.sum(np.abs( time*parameter[0] + parameter[1] - position))

#fit of data
time = np.arange(0,len(positions))
pfit = np.polyfit( time, positions, 1)
p0 = pfit
pfit2 = scipy.optimize.minimize( to_minimize, p0, args=(time,positions)).x
diff2 = np.abs( positions - np.polyval(pfit2, time) )
               
# 'draw'
plt.figure( figsize=(16,3.5))
plt.subplot(131)
plt.title('data')
plt.plot(positions, label='positions')
plt.plot( np.polyval(pfit2, time), label='fit')
plt.legend()
plt.subplot(132)
plt.title('find outliers...')
plt.plot( distance, label='distances')
plt.plot( np.abs( positions - np.polyval(pfit2, time)), label='fit error')
plt.legend()
plt.subplot(133)
plt.hist( diff2, bins=200 )
plt.title('histogram of difference')

It depends on the complexity of your data... You could look at the field of Single Particle Tracking. They developed a lot of algorithms to follow particle's trajectories, and detect outliers.

If your error often look like your example, then a simple case, you could use numpy polyfit to estimate a linear trajectories. But the fit will tend to overadjust the outliers. Then i propose to use scipy.minimize function on just an absolute norm1 (not distance but its square root). Taking the fit difference with your data really overlight the outliers. You can then use a threshold to separate them. (Otsu threshold maybe ?). This is clear on the final histogram that two groups exists.

distance = np.diff(positions).astype('int')



def to_minimize(parameter, time,position):
    return np.sum(np.abs( time*parameter[0] + parameter[1] - position))

#fit of data
time = np.arange(0,len(positions))
pfit = np.polyfit( time, positions, 1)
p0 = pfit
pfit2 = scipy.optimize.minimize( to_minimize, p0, args=(time,positions)).x
diff2 = np.abs( positions - np.polyval(pfit2, time) )
               
# 'draw'
plt.figure( figsize=(16,3.5))
plt.subplot(131)
plt.title('data')
plt.plot(positions, label='positions')
plt.plot( np.polyval(pfit2, time), label='fit')
plt.legend()
plt.subplot(132)
plt.title('find outliers...')
plt.plot( distance, label='distances')
plt.plot( np.abs( positions - np.polyval(pfit2, time)), label='fit error')
plt.legend()
plt.subplot(133)
plt.hist( diff2, bins=200 )
plt.title('histogram of difference')

回复收藏 0 原文

妞丶爷亲个 2025-02-04 07:01:53

我不假装就这个话题回答。只是想分享一些想法。有兴趣在此图上叠加回归线，并在值和回归线之间差异。将偏差计算为百分比。
事实证明，由于右侧的急剧上升，回归线进行了调整，从而变成了负值。因此，左侧的差异大于右侧的差异。
可能会从右到左至左的偏差，但不超过所有值的50％。
顺便说一下，我喜欢Adrien Maurice提出的橙色线的选择。

from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt

linear_movement = np.arange(0, 100, 1)
noise = np.random.normal(loc = 0.0, scale = 2.0, size = linear_movement.size)
positions = linear_movement + noise
positions[78:82] = positions[78:82] + 385

ind = np.arange(len(positions)).reshape((-1, 1))
model = LinearRegression()
model.fit(ind, positions)
reg = model.predict(ind)
delta = (np.abs(reg - positions))/np.abs(reg/100)

fig, ax = plt.subplots(2)
ax[0].plot(ind, positions)
ax[0].plot(ind, reg)
ax[1].plot(ind, delta)

fig.autofmt_xdate()
plt.show()

I do not pretend to answer on this topic. Just wanted to share some thoughts. Was interested in superimposing a regression line on this graph and taking the difference between the values and the regression line. Calculate deviations as a percentage.
It turned out that due to the steep rise on the right, the regression line adjusted so that it went into negative values. Because of this, the difference on the left is greater than on the right.
Can probably take deviations from right to left, but not more than 50% of all values.?
And by the way, I like the option of the orange line proposed by Adrien Maurice more.

from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt

linear_movement = np.arange(0, 100, 1)
noise = np.random.normal(loc = 0.0, scale = 2.0, size = linear_movement.size)
positions = linear_movement + noise
positions[78:82] = positions[78:82] + 385

ind = np.arange(len(positions)).reshape((-1, 1))
model = LinearRegression()
model.fit(ind, positions)
reg = model.predict(ind)
delta = (np.abs(reg - positions))/np.abs(reg/100)

fig, ax = plt.subplots(2)
ax[0].plot(ind, positions)
ax[0].plot(ind, reg)
ax[1].plot(ind, delta)

fig.autofmt_xdate()
plt.show()