当前位置：文江博客话题详情

如何使用Python在数据集中过滤UNUSEFULL数据？

发布于 2025-01-20 15:22:50 字数 155 浏览 1 评论 0原文

我有一个数据集：不同范围的温度和压力值。我想滤除所有使x％偏离“正常”值的数据。该数据发生在过程故障上。

额外：正常值可能会在更长的时间内变化，因此在Timestamp1处的例外情况可以在Timestamp2处是正常的。

我研究了一些噪音过滤器，但我不确定这是噪音。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

写下不归期 2025-01-27 15:22:50

你问了两个问题。

1.

附加派生列，以便轻松过滤。

对于“x%”，例如百分之五，您可以使用

avg = np.mean(df.pressure)
df['pres_deviation'] = abs(df.pressure - avg) / avg
print(df[df.pres_deviation < .05])

But 而不是使用百分比，
你可能会发现使用标准差更自然，
过滤掉例如距平均值超过三个标准差的值。
请参阅

2.

（额外：正常值可能会随着时间的推移而变化。）

您可以使用“最近 100 个样本”的窗口来定义平滑平均值，将其存储为额外的列，并替换上面计算中的 avg 标量。

更一般地说，您可以手动将高/低阈值设置为数据中的时间序列。

您所描述的区域称为“变化点检测”，我们找到了有关它的大量文献，请参阅 https://paperswithcode.com/task/change-point-detection 。
我使用过 ruptures 效果很好，推荐给你。

You asked two questions.

1.

Tack on a derived column, so it's easy to filter.

For "x%", like five percent, you might use

avg = np.mean(df.pressure)
df['pres_deviation'] = abs(df.pressure - avg) / avg
print(df[df.pres_deviation < .05])

But rather than working with a percentage,
you might find it more natural to work with standard deviations,
filtering out e.g. values more than three standard deviations from the mean.
See

2.

(Extra: the normal value can change over time.)

You could use a window of "most recent 100 samples" to define a smoothed average, store that as an extra column, and it replaces the avg scalar in the calculations above.

More generally you could manually set high / low thresholds as a time series in your data.

The area you're describing is called "change point detection", and we find an extensive literature on it, see e.g. https://paperswithcode.com/task/change-point-detection .
I have used ruptures to good effect, and I recommend it to you.

回复收藏 0 原文

~没有更多了~