如何使用Python在数据集中过滤UNUSEFULL数据?
我有一个数据集:不同范围的温度和压力值。 我想滤除所有使x%偏离“正常”值的数据。该数据发生在过程故障上。
额外:正常值可能会在更长的时间内变化,因此在Timestamp1处的例外情况可以在Timestamp2处是正常的。
我研究了一些噪音过滤器,但我不确定这是噪音。
I have a dataset : temperature and pressure values in different ranges.
I want to filter out all data that deviates more than x% from the "normal" value. This data occurs on process failures .
Extra: the normal value can change over a longer time , so what is a exception at timestamp1 can be normal at timestamp2.
I looked into some noise-filters but i'm not sure this is noise.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
你问了两个问题。
1.
附加派生列,以便轻松过滤。
对于“x%”,例如百分之五,您可以使用
But 而不是使用百分比,
你可能会发现使用标准差更自然,
过滤掉例如距平均值超过三个标准差的值。
请参阅
2.
(额外:正常值可能会随着时间的推移而变化。)
您可以使用“最近 100 个样本”的窗口来定义平滑平均值,将其存储为额外的列,并替换上面计算中的
avg
标量。更一般地说,您可以手动将高/低阈值设置为数据中的时间序列。
您所描述的区域称为“变化点检测”,我们找到了有关它的大量文献,请参阅 https://paperswithcode.com/task/change-point-detection 。
我使用过 ruptures 效果很好,推荐给你。
You asked two questions.
1.
Tack on a derived column, so it's easy to filter.
For "x%", like five percent, you might use
But rather than working with a percentage,
you might find it more natural to work with standard deviations,
filtering out e.g. values more than three standard deviations from the mean.
See
2.
(Extra: the normal value can change over time.)
You could use a window of "most recent 100 samples" to define a smoothed average, store that as an extra column, and it replaces the
avg
scalar in the calculations above.More generally you could manually set high / low thresholds as a time series in your data.
The area you're describing is called "change point detection", and we find an extensive literature on it, see e.g. https://paperswithcode.com/task/change-point-detection .
I have used ruptures to good effect, and I recommend it to you.