时间戳之间的相关性

发布于 2025-02-04 00:34:32 字数 433 浏览 4 评论 0原文

我希望这不是脱离话题,我不确定要用于这样一个问题的论坛:

我有一系列的数据点,大约一个小时的时间从传感器中,每秒检索20次数据。与此同时,我以>%y-%MD%H:%M:%S。%f的格式收到此数据中的定期事件的时间戳,看起来像This 2019 -05-23 17:50:34.346000

现在,我创建了一种自己计算这些周期性事件的方法,并且想知道如何评估方法的准确性。与实际的时间戳相比,我的计算有时更大,有时甚至较小几毫秒。但是,当我使用pythons scipy.stats.pearsonr(x,y)方法运行自己计算的时间戳针对实际的时间戳时,我总是会收到近1个相关性。我认为这是因为这些很小的差异按照毫秒的顺序,似乎在一个小时的数据中没有相关。但是,如何一种合理的方式评估两个时间戳的准确性呢?是否有比相关性更好的指标?

I hope this isn’t off topic, I am not really sure which forum to use for a question like this:

I have a series of datapoints of about an hour in time from a sensor that retrieves data 20 times per second. Along with it I receive timestamps of a periodic event in this data in the format of %Y-%m-d %H:%M:%S.%f, which looks e.g. like this 2019-05-23 17:50:34.346000.

I now created a method to calculate these periodic events myself and was wondering how I could evalute my methods accuracy. My calculations are sometimes bigger and sometimes smaller by a few milliseconds compared to the actual timestamp. But when I run my own calculated timestamp against the actual timestamp by using pythons scipy.stats.pearsonr(x,y) method, I always receive a correlation of nearly 1. I assume that‘s because these small differences in the order of millisenconds don‘t seem relevant in an hour of data. But how could I evaluate the accuracy of two timestamps a reasonable way? Are there better metrics to use than the correlation?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

月亮是我掰弯的 2025-02-11 00:34:32

看来您正在尝试计算一个线性统计相关(Pearson),因为本质上是时间表数据。这不会告诉您太多,并且根据结果得出结论是危险的。

碰巧的是,您的两个向量xy在线性上线性增长,鉴于它们是时间戳,这并不奇怪。

让我们以固定数据和时间序列数据的示例:

时间序列数据:

您的传感器开始在时间上进行测量t1,并继续这样做,直到时间t2 t2 < /代码>已达到。您使用自己的方法计算定期事件的时间戳,然后将其与实际时间戳进行比较。但是,使用线性统计相关性可以查看这两者是否相关以及它们的相关性如何,没有可靠的方式。

固定数据:

现在考虑使用相同的传感器进行测量,但是现在,不再是一次计算您的周期性事件,而是使用不同的测量结果进行单个事件并使用经验数据多次计算它(所以忘记了在这一点上的任何时间概念(即多次重复测量) 事件进行比较。

与您的

  • 单个 答案(例如,定期事件)y_truth

  • 现在您有两个向量,一个是测量的,一个是地面真相。在每个向量中,您都有一个定期事件(例如id)的指标。我将在所有id的数十个次数上重复整个测量值。

  • 对于每个“ ID”,我将计算您要查找的任何测量值(时间戳或时间或时间或其他...),然后我会减去两个时间戳:| y_truth -y_measured -y_measured |。这称为残差或换句话说,您的错误。

  • 现在避免所有id的所有残差给您一些称为含义的绝对错误(1/n * sum(| y_truth -y_measured -y_measured |)可以非常自信地报告在一个单位(例如秒)内产生多少错误。

It seems that you are trying to compute a linear statistical correlation (pearson) for something that is, by nature, a timeseries data. This will not tell you much and drawing a conclusion based on the results is dangerous.

It so happens that your two vectors x and y are growing linearly in the same direction which is not surprising given that they are timestamps.

Let's take an example for stationary data and time series data:

Time series data:

Your sensor starts giving measurements at time t1 and continues to do so until time t2 is reached. You compute the periodic event's timestamp using your own method then compare it to the actual timestamp. However, there is no reliable way using linear statistical correlations to see if the two are related and how related are they.

Stationary data :

Now consider the same sensor giving measurements, but now instead of computing your periodic events all at once, take a single event and compute it multiple times using your empirical data using different measurements (so forget about any notion of time at this point (i.e. repeat the measurement multiple times). The result can be averaged and an error on the mean can be computed (see info on standard error). This, now, can be compared to your single event. Based on the error, you can get a more or less feel of how good or bad your method is.

I would recommend the following :

  • You have your ground truth answer (say, the periodic event) y_truth. You compute a vector of the periodic events based on your sensor and your own method mapped as a function f(sensor_input) = y_measured

  • Now you have two vectors, one measured and one that is ground truth. In each of those vectors, you have an indicator of a the periodic events such as an id. I would repeat the whole set of measurements, on all id's tens of times.

  • For each 'id' I would compute whatever measurement you are looking for (either a timestamp or time in seconds or whatever...) then I would subtract the two timestamps : |y_truth - y_measured|. This is called residuals or in other words, your error.

  • Now averging all the residuals of all the id's gives you something called mean absolute error (1/n * sum (|y_truth - y_measured|) which you can very confidently use to report how much error, in a unit of time (seconds for example), your method produces.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文