如何实现数据框中列的最接近值的等级函数？

发布于 2025-01-27 00:48:03 字数 2117 浏览 2 评论 0原文

df.head():

                      run_time match_datetime         country                league             home_team                  away_team
0   2021-08-07 00:04:36.326391     2021-08-06          Russia       FNL 2 - Group 2             Yenisey 2          Lokomotiv-Kazanka
1   2021-08-07 00:04:36.326391     2021-08-07          Russia          Youth League              Ural U19  Krylya Sovetov Samara U19
2   2021-08-07 00:04:36.326391     2021-08-08           World         Club Friendly                Alaves                    Al Nasr
3   2021-08-07 00:04:36.326391     2021-08-09           China            Jia League     Chengdu Rongcheng          Shenyang Urban FC
4   2021-08-06 00:04:36.326391     2021-08-06           China          Super League              Wuhan FC       Tianjin Jinmen Tiger
5   2021-08-06 00:04:36.326391     2021-08-07  Czech Republic            U19 League     Sigma Olomouc U19                Karvina U19
6   2021-08-06 00:04:36.326391     2021-08-08          Russia          Youth League  Konoplev Academy U19            Rubin Kazan U19
7   2021-08-06 00:04:36.326391     2021-08-09           World         Club Friendly         Real Sociedad                      Eibar

所需的DF

                      run_time match_datetime         country                league             home_team                  away_team
0   2021-08-07 00:04:36.326391     2021-08-06          Russia       FNL 2 - Group 2             Yenisey 2          Lokomotiv-Kazanka
1   2021-08-07 00:04:36.326391     2021-08-07          Russia          Youth League              Ural U19  Krylya Sovetov Samara U19
4   2021-08-06 00:04:36.326391     2021-08-06           China          Super League              Wuhan FC       Tianjin Jinmen Tiger
5   2021-08-06 00:04:36.326391     2021-08-07  Czech Republic            U19 League     Sigma Olomouc U19                Karvina U19

如何使用等级函数仅过滤2个最近的match_dateTime每个run_time值的日期。 IE所需的数据框将是一个过滤的数据框架，每个match_dateTime值都将为每个run_time>

原文

df.head():

                      run_time match_datetime         country                league             home_team                  away_team
0   2021-08-07 00:04:36.326391     2021-08-06          Russia       FNL 2 - Group 2             Yenisey 2          Lokomotiv-Kazanka
1   2021-08-07 00:04:36.326391     2021-08-07          Russia          Youth League              Ural U19  Krylya Sovetov Samara U19
2   2021-08-07 00:04:36.326391     2021-08-08           World         Club Friendly                Alaves                    Al Nasr
3   2021-08-07 00:04:36.326391     2021-08-09           China            Jia League     Chengdu Rongcheng          Shenyang Urban FC
4   2021-08-06 00:04:36.326391     2021-08-06           China          Super League              Wuhan FC       Tianjin Jinmen Tiger
5   2021-08-06 00:04:36.326391     2021-08-07  Czech Republic            U19 League     Sigma Olomouc U19                Karvina U19
6   2021-08-06 00:04:36.326391     2021-08-08          Russia          Youth League  Konoplev Academy U19            Rubin Kazan U19
7   2021-08-06 00:04:36.326391     2021-08-09           World         Club Friendly         Real Sociedad                      Eibar

desired df

                      run_time match_datetime         country                league             home_team                  away_team
0   2021-08-07 00:04:36.326391     2021-08-06          Russia       FNL 2 - Group 2             Yenisey 2          Lokomotiv-Kazanka
1   2021-08-07 00:04:36.326391     2021-08-07          Russia          Youth League              Ural U19  Krylya Sovetov Samara U19
4   2021-08-06 00:04:36.326391     2021-08-06           China          Super League              Wuhan FC       Tianjin Jinmen Tiger
5   2021-08-06 00:04:36.326391     2021-08-07  Czech Republic            U19 League     Sigma Olomouc U19                Karvina U19

How do i use rank function to filter only the 2 nearest match_datetime dates for every run_time value.
i.e. desired dataframe will be a filtered dataframe that will have all the nearest 2 match_datetime values for every run_time

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

千寻… 2025-02-03 00:48:03

更新

使用等级而不是head：

diff = pd.to_datetime(df['run_time']).sub(pd.to_datetime(df['match_datetime'])).abs()
out = df.loc[diff.groupby(df['run_time']).rank(method='dense') <= 2]

输出：

>>> out
                     run_time match_datetime         country         league          home_team                  away_team
1  2021-08-07 00:04:36.326391     2021-08-07          Russia   Youth League           Ural U19  Krylya Sovetov Samara U19
2  2021-08-07 00:04:36.326391     2021-08-08           World  Club Friendly             Alaves                    Al Nasr
4  2021-08-06 00:04:36.326391     2021-08-06           China   Super League           Wuhan FC       Tianjin Jinmen Tiger
5  2021-08-06 00:04:36.326391     2021-08-07  Czech Republic     U19 League  Sigma Olomouc U19                Karvina U19

替代

您可以使用：

diff = pd.to_datetime(df['run_time']).sub(pd.to_datetime(df['match_datetime'])) \
                              .abs().sort_values()
out = df.loc[diff.groupby(df['run_time']).head(2).index].sort_index()

Update

Using rank instead of head:

diff = pd.to_datetime(df['run_time']).sub(pd.to_datetime(df['match_datetime'])).abs()
out = df.loc[diff.groupby(df['run_time']).rank(method='dense') <= 2]

Output:

>>> out
                     run_time match_datetime         country         league          home_team                  away_team
1  2021-08-07 00:04:36.326391     2021-08-07          Russia   Youth League           Ural U19  Krylya Sovetov Samara U19
2  2021-08-07 00:04:36.326391     2021-08-08           World  Club Friendly             Alaves                    Al Nasr
4  2021-08-06 00:04:36.326391     2021-08-06           China   Super League           Wuhan FC       Tianjin Jinmen Tiger
5  2021-08-06 00:04:36.326391     2021-08-07  Czech Republic     U19 League  Sigma Olomouc U19                Karvina U19

Alternative

You can use:

diff = pd.to_datetime(df['run_time']).sub(pd.to_datetime(df['match_datetime'])) \
                              .abs().sort_values()
out = df.loc[diff.groupby(df['run_time']).head(2).index].sort_index()

回复收藏 0 原文

长不大的小祸害 2025-02-03 00:48:03

我以某种方式担心pandas.dataframe.rank方法无法执行此操作。但是pandas.dataframe.groupbyby可以使用pandas.dataframe.head与之一起执行此操作。

前提

import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame(np.array([np.random.randint(0, 3, 10), np.random.rand(10)]).transpose(), columns=['a', 'b'])

/code>：

max_num_per_example = 2
df.groupby(['a']).head(max_num_per_example)

产生

	A	B
0	2.0	0.058084
1	0.0	0.866176
2	2.0	0.601115
4	0.0 0.0	0.020584
7	1.0	0.212339

的方法，这就是相同

max_idx_per_example = 2
idx_to_keep = []
for el_uq in df['a'].unique():
    lg = el_uq == df['a']
    for i, idx in enumerate(lg[lg].index):
        if i < max_idx_per_example:
            idx_to_keep.append(idx)
        else:
            break
df_new = df.iloc[idx_to_keep]

如果您使用天真 =）

I am somehow afraid that the pandas.DataFrame.rank method can't do this. But pandas.DataFrame.groupby can do this, if you use pandas.DataFrame.head with it.

Assuming you have the following pandas.DataFrame:

import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame(np.array([np.random.randint(0, 3, 10), np.random.rand(10)]).transpose(), columns=['a', 'b'])

And that you want to keep max_num_per_example = 2 representatives of each unique values in the column df['a']:

max_num_per_example = 2
df.groupby(['a']).head(max_num_per_example)

yields

	a	b
0	2.0	0.058084
1	0.0	0.866176
2	2.0	0.601115
4	0.0	0.020584
7	1.0	0.212339

This is the same as you would get if you to the naive approach:

max_idx_per_example = 2
idx_to_keep = []
for el_uq in df['a'].unique():
    lg = el_uq == df['a']
    for i, idx in enumerate(lg[lg].index):
        if i < max_idx_per_example:
            idx_to_keep.append(idx)
        else:
            break
df_new = df.iloc[idx_to_keep]

Which underlines the power of pandas =)

回复收藏 0 原文

~没有更多了~