根据 pandas DataFrame 中的最后 N 行比较两列

发布于 2025-01-12 08:04:51 字数 1664 浏览 0 评论 0原文

我想要groupby“ts_code”并根据每组的最后 N 行计算最大值后另一列的最大值和最小值之间的百分比。具体来说，

df

ts_code high low
0   A   20  10
1   A   30  5
2   A   40  20
3   A   50  10
4   A   20  30
5   B   20  10
6   B   30  5
7   B   40  20
8   B   50  10
9   B   20  30

目标

下面是我的预期结果

   ts_code  high low l3_high_low_pct_chg    l4_high_low_pct_chg
    0   A   20  10  NA  NA
    1   A   30  5   NA  NA
    2   A   40  20  0.5 NA
    3   A   50  10  0.8 0.8
    4   A   20  30  0.8 0.8
    5   B   50  10  NA  NA
    6   B   30  5   NA  NA
    7   B   40  20  0.9 NA
    8   B   10  10  0.75    0.9
    9   B   20  30  0.75    0.75

ln_high_low_pct_chg（例如l3_high_low_pct_chg）= 1-（最小值峰值后的 low 列的值）/（high 列的最大值），每组和每行的最后 N 行。

尝试并解决问题

df['l3_highest']=df.groupby('ts_code')['high'].transform(lambda x: x.rolling(3).max())
df['l3_lowest']=df.groupby('ts_code')['low'].transform(lambda x: x.rolling(3).min())
df['l3_high_low_pct_chg']=1-df['l3_lowest']/df['l3_highest']

但是它失败了，对于第二行，l3_lowest将是5而不是20。我不知道如何计算峰值后的百分比。

对于最后 4 行，索引 = 8，低 = 10，高 = 50，低 = 5，l4_high_low_pct_chg=0.9 ，索引 = 9，高 = 40，低 = 10，l4_high_low_pct_chg=0.75

另一个测试数据

如果滚动窗口为52，对于hy_code 880912组和索引1252，l52_high_low_pct_chg 将为 0.281131，880301 组和索引 1251，l52_high_low_pct_chg 将为 0.321471。

原文

I want to groupby "ts_code" and calculate percentage between one column max and min value from another column after max based on last N rows for each group. Specifically,

df

ts_code high low
0   A   20  10
1   A   30  5
2   A   40  20
3   A   50  10
4   A   20  30
5   B   20  10
6   B   30  5
7   B   40  20
8   B   50  10
9   B   20  30

Goal

Below is my expected result

   ts_code  high low l3_high_low_pct_chg    l4_high_low_pct_chg
    0   A   20  10  NA  NA
    1   A   30  5   NA  NA
    2   A   40  20  0.5 NA
    3   A   50  10  0.8 0.8
    4   A   20  30  0.8 0.8
    5   B   50  10  NA  NA
    6   B   30  5   NA  NA
    7   B   40  20  0.9 NA
    8   B   10  10  0.75    0.9
    9   B   20  30  0.75    0.75

ln_high_low_pct_chg(such as l3_high_low_pct_chg)= 1-(the min value of the low column after the peak)/(the max value of high column),on last N rows for each group and each row.

Try and problem

df['l3_highest']=df.groupby('ts_code')['high'].transform(lambda x: x.rolling(3).max())
df['l3_lowest']=df.groupby('ts_code')['low'].transform(lambda x: x.rolling(3).min())
df['l3_high_low_pct_chg']=1-df['l3_lowest']/df['l3_highest']

But it fails such that for second row, the l3_lowest would be 5 not 20. I don't know how to calculate percentage after peak.

For last 4 rows, at index=8, low=10,high=50,low=5, l4_high_low_pct_chg=0.9
, at index=9, high=40, low=10, l4_high_low_pct_chg=0.75

Another test data

If the rolling window is 52, for hy_code 880912 group and index 1252, l52_high_low_pct_chg would be 0.281131 and 880301 group and index 1251, l52_high_low_pct_chg would be 0.321471.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

帅气尐潴 2025-01-19 08:04:51

按“ts_code”分组只是一个简单的 groupby() 函数。 DataFrame.rolling() 函数适用于单列，因此如果您需要来自多列的数据，则应用它会很困难。您可以使用“from numpy_ext import moving_apply as moving_apply_ext”，如下例所示： Pandas 滚动应用使用多列。但是，我刚刚创建了一个函数，手动将数据帧分组为 n 个长度的子数据帧，然后应用该函数来计算值。 idxmax() 找到低列峰值的索引值，然后我们找到后面的值的 min()。剩下的事情就非常简单了。

import numpy as np
import pandas as pd

df = pd.DataFrame([['A', 20, 10],
    ['A', 30, 5],
    ['A', 40, 20],
    ['A', 50, 10],
    ['A', 20, 30],
    ['B', 50, 10],
    ['B', 30, 5],
    ['B', 40, 20],
    ['B', 10, 10],
    ['B', 20, 30]],
    columns=['ts_code', 'high', 'low']
)
    
 
def custom_f(df, n):
    s = pd.Series(np.nan, index=df.index)

    def sub_f(df_):
        high_peak_idx = df_['high'].idxmax()
        min_low_after_peak = df_.loc[high_peak_idx:]['low'].min()
        max_high = df_['high'].max()
        return 1 - min_low_after_peak / max_high

    for i in range(df.shape[0] - n + 1):
        df_ = df.iloc[i:i + n]
        s.iloc[i + n - 1] = sub_f(df_)

    return s


df['l3_high_low_pct_chg'] = df.groupby("ts_code").apply(custom_f, 3).values
df['l4_high_low_pct_chg'] = df.groupby("ts_code").apply(custom_f, 4).values


print(df)

如果您更喜欢使用滚动函数，此方法会给出相同的输出：

def rolling_f(rolling_df):
    df_ = df.loc[rolling_df.index]
    high_peak_idx = df_['high'].idxmax()
    min_low_after_peak = df_.loc[high_peak_idx:]["low"].min()
    max_high = df_['high'].max()
    return 1 - min_low_after_peak / max_high

df['l3_high_low_pct_chg'] = df.groupby("ts_code").rolling(3).apply(rolling_f).values[:, 0]
df['l4_high_low_pct_chg'] = df.groupby("ts_code").rolling(4).apply(rolling_f).values[:, 0]

print(df)

最后，如果您想要进行真正的滚动窗口计算以避免任何索引查找，您可以使用 numpy_ext (https://pypi.org/project/numpy-ext/)

from numpy_ext import rolling_apply

def np_ext_f(rolling_df, n):
    def rolling_apply_f(high, low):
        return 1 - low[np.argmax(high):].min() / high.max()
    try:
        return pd.Series(rolling_apply(rolling_apply_f, n, rolling_df['high'].values, rolling_df['low'].values), index=rolling_df.index)
    except ValueError:
        return pd.Series(np.nan, index=rolling_df.index)


df['l3_high_low_pct_chg'] = df.groupby('ts_code').apply(np_ext_f, n=3).sort_index(level=1).values
df['l4_high_low_pct_chg'] = df.groupby('ts_code').apply(np_ext_f, n=4).sort_index(level=1).values

print(df)

输出：

  ts_code  high  low  l3_high_low_pct_chg  l4_high_low_pct_chg
0       A    20   10                  NaN                  NaN
1       A    30    5                  NaN                  NaN
2       A    40   20                 0.50                  NaN
3       A    50   10                 0.80                 0.80
4       A    20   30                 0.80                 0.80
5       B    50   10                  NaN                  NaN
6       B    30    5                  NaN                  NaN
7       B    40   20                 0.90                  NaN
8       B    10   10                 0.75                 0.90
9       B    20   30                 0.75                 0.75

对于大型数据集，这些操作的速度成为一个问题。因此，为了比较这些不同方法的速度，我创建了一个计时函数：

import time

def timeit(f):

    def timed(*args, **kw):
        ts = time.time()
        result = f(*args, **kw)
        te = time.time()
        print ('func:%r took: %2.4f sec' % \
          (f.__name__, te-ts))
        return result

    return timed

接下来，让我们制作一个大型 DataFrame，只需复制现有数据帧 500 次：

df = pd.concat([df for x in range(500)], axis=0)
df = df.reset_index()

最后，我们在计时函数下运行三个测试：

@timeit
def method_1():
    df['l52_high_low_pct_chg'] = df.groupby("ts_code").apply(custom_f, 52).values
method_1()

@timeit
def method_2():
    df['l52_high_low_pct_chg'] = df.groupby("ts_code").rolling(52).apply(rolling_f).values[:, 0]
method_2()

@timeit
def method_3():
    df['l52_high_low_pct_chg'] = df.groupby('ts_code').apply(np_ext_f, n=52).sort_index(level=1).values
method_3()

这给了我们这个输出：

func:'method_1' took: 2.5650 sec
func:'method_2' took: 15.1233 sec
func:'method_3' took: 0.1084 sec

因此，最快的方法是使用 numpy_ext，这是有意义的，因为它针对矢量化计算进行了优化。第二快的方法是我编写的自定义函数，它有点高效，因为它执行一些矢量化计算，同时还执行一些 Pandas 查找。迄今为止最慢的方法是使用 Pandas 滚动函数。

Grouping by 'ts_code' is just a trivial groupby() function. DataFrame.rolling() function is for single columns, so it's a tricky to apply it if you need data from multiple columns. You can use "from numpy_ext import rolling_apply as rolling_apply_ext" as in this example: Pandas rolling apply using multiple columns. However, I just created a function that manually groups the dataframe into n length sub-dataframes, then applies the function to calculate the value. idxmax() finds the index value of the peak of the low column, then we find the min() of the values that follow. The rest is pretty straightforward.

import numpy as np
import pandas as pd

df = pd.DataFrame([['A', 20, 10],
    ['A', 30, 5],
    ['A', 40, 20],
    ['A', 50, 10],
    ['A', 20, 30],
    ['B', 50, 10],
    ['B', 30, 5],
    ['B', 40, 20],
    ['B', 10, 10],
    ['B', 20, 30]],
    columns=['ts_code', 'high', 'low']
)
    
 
def custom_f(df, n):
    s = pd.Series(np.nan, index=df.index)

    def sub_f(df_):
        high_peak_idx = df_['high'].idxmax()
        min_low_after_peak = df_.loc[high_peak_idx:]['low'].min()
        max_high = df_['high'].max()
        return 1 - min_low_after_peak / max_high

    for i in range(df.shape[0] - n + 1):
        df_ = df.iloc[i:i + n]
        s.iloc[i + n - 1] = sub_f(df_)

    return s


df['l3_high_low_pct_chg'] = df.groupby("ts_code").apply(custom_f, 3).values
df['l4_high_low_pct_chg'] = df.groupby("ts_code").apply(custom_f, 4).values


print(df)

If you prefer to use the rolling function, this method gives the same output:

def rolling_f(rolling_df):
    df_ = df.loc[rolling_df.index]
    high_peak_idx = df_['high'].idxmax()
    min_low_after_peak = df_.loc[high_peak_idx:]["low"].min()
    max_high = df_['high'].max()
    return 1 - min_low_after_peak / max_high

df['l3_high_low_pct_chg'] = df.groupby("ts_code").rolling(3).apply(rolling_f).values[:, 0]
df['l4_high_low_pct_chg'] = df.groupby("ts_code").rolling(4).apply(rolling_f).values[:, 0]

print(df)

Finally, if you want to do a true rolling window calculation that avoids any index lookup, you can use the numpy_ext (https://pypi.org/project/numpy-ext/)

from numpy_ext import rolling_apply

def np_ext_f(rolling_df, n):
    def rolling_apply_f(high, low):
        return 1 - low[np.argmax(high):].min() / high.max()
    try:
        return pd.Series(rolling_apply(rolling_apply_f, n, rolling_df['high'].values, rolling_df['low'].values), index=rolling_df.index)
    except ValueError:
        return pd.Series(np.nan, index=rolling_df.index)


df['l3_high_low_pct_chg'] = df.groupby('ts_code').apply(np_ext_f, n=3).sort_index(level=1).values
df['l4_high_low_pct_chg'] = df.groupby('ts_code').apply(np_ext_f, n=4).sort_index(level=1).values

print(df)

output:

  ts_code  high  low  l3_high_low_pct_chg  l4_high_low_pct_chg
0       A    20   10                  NaN                  NaN
1       A    30    5                  NaN                  NaN
2       A    40   20                 0.50                  NaN
3       A    50   10                 0.80                 0.80
4       A    20   30                 0.80                 0.80
5       B    50   10                  NaN                  NaN
6       B    30    5                  NaN                  NaN
7       B    40   20                 0.90                  NaN
8       B    10   10                 0.75                 0.90
9       B    20   30                 0.75                 0.75

For large datasets, the speed of these operations becomes an issue. So, to compare the speed of these different methods, I created a timing function:

import time

def timeit(f):

    def timed(*args, **kw):
        ts = time.time()
        result = f(*args, **kw)
        te = time.time()
        print ('func:%r took: %2.4f sec' % \
          (f.__name__, te-ts))
        return result

    return timed

Next, let's make a large DataFrame, just by copying the existing dataframe 500 times:

df = pd.concat([df for x in range(500)], axis=0)
df = df.reset_index()

Finally, we run the three tests under a timing function:

@timeit
def method_1():
    df['l52_high_low_pct_chg'] = df.groupby("ts_code").apply(custom_f, 52).values
method_1()

@timeit
def method_2():
    df['l52_high_low_pct_chg'] = df.groupby("ts_code").rolling(52).apply(rolling_f).values[:, 0]
method_2()

@timeit
def method_3():
    df['l52_high_low_pct_chg'] = df.groupby('ts_code').apply(np_ext_f, n=52).sort_index(level=1).values
method_3()

Which gives us this output:

func:'method_1' took: 2.5650 sec
func:'method_2' took: 15.1233 sec
func:'method_3' took: 0.1084 sec

So, the fastest method is to use the numpy_ext, which makes sense because that's optimized for vectorized calculations. The second fastest method is the custom function I wrote, which is somewhat efficient because it does some vectorized calculations while also doing some Pandas lookups. The slowest method by far is using Pandas rolling function.

回复收藏 0 原文

诺曦 2025-01-19 08:04:51

对于我的解决方案，我们将使用 .groupby("ts_code") 然后使用 .rolling 来处理特定大小的组和 custom_function。此自定义函数将采用每个组，并且我们将使用这些值来查询原始数据帧，而不是直接对接收到的值应用函数。然后，我们可以通过查找“高”峰值所在的行来计算您期望的值，然后查看以下行以找到最小“低”值，最后使用您的公式计算结果：

def custom_function(group, df):
    # Query the original dataframe using the group values
    group = df.loc[group.values]
    # Calculate your formula
    high_peak_row = group["high"].idxmax()
    min_low_after_peak = group.loc[high_peak_row:, "low"].min()
    return 1 - min_low_after_peak / group.loc[high_peak_row, "high"]


# Reset the index to roll over that column and be able query the original dataframe
df["l3_high_low_pct_chg"] = df.reset_index().groupby("ts_code")["index"].rolling(3).apply(custom_function, args=(df,)).values
df["l4_high_low_pct_chg"] = df.reset_index().groupby("ts_code")["index"].rolling(4).apply(custom_function, args=(df,)).values

输出：

  ts_code  high  low  l3_high_low_pct_chg  l4_high_low_pct_chg
0       A    20   10                  NaN                  NaN
1       A    30    5                  NaN                  NaN
2       A    40   20                 0.50                  NaN
3       A    50   10                 0.80                 0.80
4       A    20   30                 0.80                 0.80
5       B    50   10                  NaN                  NaN
6       B    30    5                  NaN                  NaN
7       B    40   20                 0.90                  NaN
8       B    10   10                 0.75                 0.90
9       B    20   30                 0.75                 0.75

我们可以采用此进一步设想一个唯一的组一次：

groups = df.reset_index().groupby("ts_code")["index"]
for n in [3, 4]:
    df[f"l{n}_high_low_pct_chg"] = groups.rolling(n).apply(custom_function, args=(df,)).values

For my solution, we'll use .groupby("ts_code") then .rolling to process groups of certain size and a custom_function. This custom function will take each group, and instead of applying a function directly on the received values, we'll use those values to query the original dataframe. Then, we can calculate the values as you expect by finding the row where the "high" peak is, then look the following rows to find the minimum "low" value and finally calculate the result using your formula:

def custom_function(group, df):
    # Query the original dataframe using the group values
    group = df.loc[group.values]
    # Calculate your formula
    high_peak_row = group["high"].idxmax()
    min_low_after_peak = group.loc[high_peak_row:, "low"].min()
    return 1 - min_low_after_peak / group.loc[high_peak_row, "high"]


# Reset the index to roll over that column and be able query the original dataframe
df["l3_high_low_pct_chg"] = df.reset_index().groupby("ts_code")["index"].rolling(3).apply(custom_function, args=(df,)).values
df["l4_high_low_pct_chg"] = df.reset_index().groupby("ts_code")["index"].rolling(4).apply(custom_function, args=(df,)).values

Output:

  ts_code  high  low  l3_high_low_pct_chg  l4_high_low_pct_chg
0       A    20   10                  NaN                  NaN
1       A    30    5                  NaN                  NaN
2       A    40   20                 0.50                  NaN
3       A    50   10                 0.80                 0.80
4       A    20   30                 0.80                 0.80
5       B    50   10                  NaN                  NaN
6       B    30    5                  NaN                  NaN
7       B    40   20                 0.90                  NaN
8       B    10   10                 0.75                 0.90
9       B    20   30                 0.75                 0.75

We can take this idea further an only group once:

groups = df.reset_index().groupby("ts_code")["index"]
for n in [3, 4]:
    df[f"l{n}_high_low_pct_chg"] = groups.rolling(n).apply(custom_function, args=(df,)).values

回复收藏 0 原文

~没有更多了~