根据 pandas DataFrame 中的最后 N 行比较两列
我想要groupby
“ts_code”并根据每组的最后 N 行计算最大值后另一列的最大值和最小值之间的百分比。具体来说,
df
ts_code high low
0 A 20 10
1 A 30 5
2 A 40 20
3 A 50 10
4 A 20 30
5 B 20 10
6 B 30 5
7 B 40 20
8 B 50 10
9 B 20 30
目标
下面是我的预期结果
ts_code high low l3_high_low_pct_chg l4_high_low_pct_chg
0 A 20 10 NA NA
1 A 30 5 NA NA
2 A 40 20 0.5 NA
3 A 50 10 0.8 0.8
4 A 20 30 0.8 0.8
5 B 50 10 NA NA
6 B 30 5 NA NA
7 B 40 20 0.9 NA
8 B 10 10 0.75 0.9
9 B 20 30 0.75 0.75
ln_high_low_pct_chg
(例如l3_high_low_pct_chg
)= 1-(最小值峰值后的 low
列的值)/(high
列的最大值),每组和每行的最后 N 行。
尝试并解决问题
df['l3_highest']=df.groupby('ts_code')['high'].transform(lambda x: x.rolling(3).max())
df['l3_lowest']=df.groupby('ts_code')['low'].transform(lambda x: x.rolling(3).min())
df['l3_high_low_pct_chg']=1-df['l3_lowest']/df['l3_highest']
但是它失败了,对于第二行,l3_lowest
将是5而不是20。我不知道如何计算峰值后的百分比。
对于最后 4 行,索引 = 8,低 = 10,高 = 50,低 = 5,l4_high_low_pct_chg
=0.9 ,索引 = 9,高 = 40,低 = 10,l4_high_low_pct_chg
=0.75
- 如果滚动窗口为52,对于hy_code
880912
组和索引1252,l52_high_low_pct_chg
将为 0.281131,880301
组和索引 1251,l52_high_low_pct_chg
将为 0.321471。
I want to groupby
"ts_code" and calculate percentage between one column max and min value from another column after max based on last N rows for each group. Specifically,
df
ts_code high low
0 A 20 10
1 A 30 5
2 A 40 20
3 A 50 10
4 A 20 30
5 B 20 10
6 B 30 5
7 B 40 20
8 B 50 10
9 B 20 30
Goal
Below is my expected result
ts_code high low l3_high_low_pct_chg l4_high_low_pct_chg
0 A 20 10 NA NA
1 A 30 5 NA NA
2 A 40 20 0.5 NA
3 A 50 10 0.8 0.8
4 A 20 30 0.8 0.8
5 B 50 10 NA NA
6 B 30 5 NA NA
7 B 40 20 0.9 NA
8 B 10 10 0.75 0.9
9 B 20 30 0.75 0.75
ln_high_low_pct_chg
(such as l3_high_low_pct_chg
)= 1-(the min value of the low
column after the peak)/(the max value of high
column),on last N rows for each group and each row.
Try and problem
df['l3_highest']=df.groupby('ts_code')['high'].transform(lambda x: x.rolling(3).max())
df['l3_lowest']=df.groupby('ts_code')['low'].transform(lambda x: x.rolling(3).min())
df['l3_high_low_pct_chg']=1-df['l3_lowest']/df['l3_highest']
But it fails such that for second row, the l3_lowest
would be 5 not 20. I don't know how to calculate percentage after peak.
For last 4 rows, at index=8, low=10,high=50,low=5, l4_high_low_pct_chg
=0.9
, at index=9, high=40, low=10, l4_high_low_pct_chg
=0.75
- If the rolling window is 52, for hy_code
880912
group and index 1252,l52_high_low_pct_chg
would be 0.281131 and880301
group and index 1251,l52_high_low_pct_chg
would be 0.321471.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
按“ts_code”分组只是一个简单的 groupby() 函数。 DataFrame.rolling() 函数适用于单列,因此如果您需要来自多列的数据,则应用它会很困难。您可以使用“from numpy_ext import moving_apply as moving_apply_ext”,如下例所示: Pandas 滚动应用使用多列。但是,我刚刚创建了一个函数,手动将数据帧分组为 n 个长度的子数据帧,然后应用该函数来计算值。 idxmax() 找到低列峰值的索引值,然后我们找到后面的值的 min()。剩下的事情就非常简单了。
如果您更喜欢使用滚动函数,此方法会给出相同的输出:
最后,如果您想要进行真正的滚动窗口计算以避免任何索引查找,您可以使用 numpy_ext (https://pypi.org/project/numpy-ext/)
输出:
对于大型数据集,这些操作的速度成为一个问题。因此,为了比较这些不同方法的速度,我创建了一个计时函数:
接下来,让我们制作一个大型 DataFrame,只需复制现有数据帧 500 次:
最后,我们在计时函数下运行三个测试:
这给了我们这个输出:
因此,最快的方法是使用 numpy_ext,这是有意义的,因为它针对矢量化计算进行了优化。第二快的方法是我编写的自定义函数,它有点高效,因为它执行一些矢量化计算,同时还执行一些 Pandas 查找。迄今为止最慢的方法是使用 Pandas 滚动函数。
Grouping by 'ts_code' is just a trivial groupby() function. DataFrame.rolling() function is for single columns, so it's a tricky to apply it if you need data from multiple columns. You can use "from numpy_ext import rolling_apply as rolling_apply_ext" as in this example: Pandas rolling apply using multiple columns. However, I just created a function that manually groups the dataframe into n length sub-dataframes, then applies the function to calculate the value. idxmax() finds the index value of the peak of the low column, then we find the min() of the values that follow. The rest is pretty straightforward.
If you prefer to use the rolling function, this method gives the same output:
Finally, if you want to do a true rolling window calculation that avoids any index lookup, you can use the numpy_ext (https://pypi.org/project/numpy-ext/)
output:
For large datasets, the speed of these operations becomes an issue. So, to compare the speed of these different methods, I created a timing function:
Next, let's make a large DataFrame, just by copying the existing dataframe 500 times:
Finally, we run the three tests under a timing function:
Which gives us this output:
So, the fastest method is to use the numpy_ext, which makes sense because that's optimized for vectorized calculations. The second fastest method is the custom function I wrote, which is somewhat efficient because it does some vectorized calculations while also doing some Pandas lookups. The slowest method by far is using Pandas rolling function.
对于我的解决方案,我们将使用
.groupby("ts_code")
然后使用.rolling
来处理特定大小的组和custom_function
。此自定义函数将采用每个组,并且我们将使用这些值来查询原始数据帧,而不是直接对接收到的值应用函数。然后,我们可以通过查找“高”峰值所在的行来计算您期望的值,然后查看以下行以找到最小“低”值,最后使用您的公式计算结果:输出:
我们可以采用此进一步设想一个唯一的组一次:
For my solution, we'll use
.groupby("ts_code")
then.rolling
to process groups of certain size and acustom_function
. This custom function will take each group, and instead of applying a function directly on the received values, we'll use those values to query the original dataframe. Then, we can calculate the values as you expect by finding the row where the "high" peak is, then look the following rows to find the minimum "low" value and finally calculate the result using your formula:Output:
We can take this idea further an only group once: