每当ABS(差)(差异),因为先前的样本超过阈值,下样本时间序列

发布于 2025-01-30 08:36:31 字数 530 浏览 1 评论 0原文

我有一段时间的时间表,随着时间的流逝,股票的股票价格会逐渐变化。每当发生较小的变化(例如价格上涨0.01美元)时,就会创建一行新的数据。这导致一个非常大的数据系列,该系列的绘制缓慢。我想下样本,以便忽略了小更改(例如价格上涨/下/向上/向上/向上/向下/向上/向下/向下/向下,并且在50行数据后不变),从而提高了绘图速度而不牺牲图形的定性准确性。我只想在价格上涨/上/向上/上升时进行采样,以便我只显示明显的更改。

import pandas as pd
import numpy as np
prices = pd.DataFrame(np.random.randint(0,1000, size=(100, 1))/100+1000, columns=list('A'))

我希望每当与先前样本的差异超过一定阈值时进行采样。因此,默认情况下,我将采样第0行。如果第1、2、3和4排离第0行太近,我想把它们扔掉。然后,如果第5行距离第0行足够远,我将对其进行采样。然后,第5行成为我的新锚点,我将重复上面立即描述的相同过程。

有没有办法这样做,理想情况下没有循环?

I have a timeseries of intraday tick-by-tick stock prices that change gradually over time. Whenever there is a small change (e.g. the price increases by $0.01), a new row of data is created. This leads to a very large data series which is slow to plot. I want to downsample so that small changes (e.g. the price goes up/down/up/down/up/down and is unchanged after 50 rows of data) are ignored, which improves plotting speed without sacrificing the qualitative accuracy of the graph. I only want to sample if the price goes up/up/up/up so that I am only displaying obvious changes.

import pandas as pd
import numpy as np
prices = pd.DataFrame(np.random.randint(0,1000, size=(100, 1))/100+1000, columns=list('A'))

I wish to sample whenever the difference with the previous sample exceeds some threshold. So, I will sample row 0 by default. If row 1, 2, 3 and 4 are too close to row 0, I want to throw them away. Then, if row 5 is sufficiently far away from row 0, I will sample that. Then, row 5 becomes my new anchor point, and I will repeat the same process described immediately above.

Is there a way to do this, ideally without a loop?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

我只土不豪 2025-02-06 08:36:31

您可以应用下采样掩蔽功能,该功能检查是否超过了距离。然后使用它选择选择适用的行。

这是下采样掩蔽函数:

def down_mask(x, max_dist=3):
    global cum_diff
    
    # if NaN return True
    if x!=x:
        return True
    
    cum_diff += x
    if abs(cum_diff) > max_dist:
        cum_diff = 0
        return True
    
    return False
    

然后将其应用并使用它作为掩码以获取所需的条目:

cum_diff = 0

df[df['prices'].diff().apply(down_mask, max_dist=5)]

     prices
0   1002.07
1   1007.37
2   1000.09
6   1008.08
10  1001.57
14  1006.74
18  1000.42
19  1006.98
21  1001.30
26  1008.89
28  1003.77
38  1009.04
40  1000.52
44  1007.06
47  1001.21
48  1009.38
49  1001.81
51  1008.64
52  1002.72
55  1008.84
56  1000.86
57  1007.17
67  1001.31
68  1006.33
79  1001.14
98  1009.74
99  1000.53

You could apply a down-sampling masking function that checks if the distance has been exceeded. Then use that to select to select the applicable rows.

Here is the down-sampling masking function:

def down_mask(x, max_dist=3):
    global cum_diff
    
    # if NaN return True
    if x!=x:
        return True
    
    cum_diff += x
    if abs(cum_diff) > max_dist:
        cum_diff = 0
        return True
    
    return False
    

Then apply it and use it as a mask to get the entries that you want:

cum_diff = 0

df[df['prices'].diff().apply(down_mask, max_dist=5)]

     prices
0   1002.07
1   1007.37
2   1000.09
6   1008.08
10  1001.57
14  1006.74
18  1000.42
19  1006.98
21  1001.30
26  1008.89
28  1003.77
38  1009.04
40  1000.52
44  1007.06
47  1001.21
48  1009.38
49  1001.81
51  1008.64
52  1002.72
55  1008.84
56  1000.86
57  1007.17
67  1001.31
68  1006.33
79  1001.14
98  1009.74
99  1000.53
打小就很酷 2025-02-06 08:36:31

不完全是要求的。我提供两个阈值和阈值和滑动期的选项。

import pandas as pd
import numpy as np

prices = pd.DataFrame(np.random.randint(0,1000, size=(100, 1))/100+1000, columns=list('A'))

threshold_ = 3
index = np.abs(prices['A'].values[1:] - prices['A'].values[:-1]) > threshold_
index = np.insert(index, 0, True)

print(prices[index == True], len(prices[index == True]))

period = 5
hist = len(prices)
index = np.abs(prices['A'].values[period:] - prices['A'].values[:hist-period]) > threshold_
index = np.insert(index, 0, np.empty((1,period), dtype=bool)[0])

print(prices[index == True], len(prices[index == True]))

Not exactly what was asked for. I offer two options with a threshold and a threshold and a sliding period.

import pandas as pd
import numpy as np

prices = pd.DataFrame(np.random.randint(0,1000, size=(100, 1))/100+1000, columns=list('A'))

threshold_ = 3
index = np.abs(prices['A'].values[1:] - prices['A'].values[:-1]) > threshold_
index = np.insert(index, 0, True)

print(prices[index == True], len(prices[index == True]))

period = 5
hist = len(prices)
index = np.abs(prices['A'].values[period:] - prices['A'].values[:hist-period]) > threshold_
index = np.insert(index, 0, np.empty((1,period), dtype=bool)[0])

print(prices[index == True], len(prices[index == True]))
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文