如何实现数据框中列的最接近值的等级函数?

发布于 2025-01-27 00:48:03 字数 2117 浏览 2 评论 0原文

df.head():

                      run_time match_datetime         country                league             home_team                  away_team
0   2021-08-07 00:04:36.326391     2021-08-06          Russia       FNL 2 - Group 2             Yenisey 2          Lokomotiv-Kazanka
1   2021-08-07 00:04:36.326391     2021-08-07          Russia          Youth League              Ural U19  Krylya Sovetov Samara U19
2   2021-08-07 00:04:36.326391     2021-08-08           World         Club Friendly                Alaves                    Al Nasr
3   2021-08-07 00:04:36.326391     2021-08-09           China            Jia League     Chengdu Rongcheng          Shenyang Urban FC
4   2021-08-06 00:04:36.326391     2021-08-06           China          Super League              Wuhan FC       Tianjin Jinmen Tiger
5   2021-08-06 00:04:36.326391     2021-08-07  Czech Republic            U19 League     Sigma Olomouc U19                Karvina U19
6   2021-08-06 00:04:36.326391     2021-08-08          Russia          Youth League  Konoplev Academy U19            Rubin Kazan U19
7   2021-08-06 00:04:36.326391     2021-08-09           World         Club Friendly         Real Sociedad                      Eibar

所需的DF

                      run_time match_datetime         country                league             home_team                  away_team
0   2021-08-07 00:04:36.326391     2021-08-06          Russia       FNL 2 - Group 2             Yenisey 2          Lokomotiv-Kazanka
1   2021-08-07 00:04:36.326391     2021-08-07          Russia          Youth League              Ural U19  Krylya Sovetov Samara U19
4   2021-08-06 00:04:36.326391     2021-08-06           China          Super League              Wuhan FC       Tianjin Jinmen Tiger
5   2021-08-06 00:04:36.326391     2021-08-07  Czech Republic            U19 League     Sigma Olomouc U19                Karvina U19

如何使用等级函数仅过滤2个最近的match_dateTime每个run_time值的日期。 IE所需的数据框将是一个过滤的数据框架,每个match_dateTime值都将为每个run_time>

df.head():

                      run_time match_datetime         country                league             home_team                  away_team
0   2021-08-07 00:04:36.326391     2021-08-06          Russia       FNL 2 - Group 2             Yenisey 2          Lokomotiv-Kazanka
1   2021-08-07 00:04:36.326391     2021-08-07          Russia          Youth League              Ural U19  Krylya Sovetov Samara U19
2   2021-08-07 00:04:36.326391     2021-08-08           World         Club Friendly                Alaves                    Al Nasr
3   2021-08-07 00:04:36.326391     2021-08-09           China            Jia League     Chengdu Rongcheng          Shenyang Urban FC
4   2021-08-06 00:04:36.326391     2021-08-06           China          Super League              Wuhan FC       Tianjin Jinmen Tiger
5   2021-08-06 00:04:36.326391     2021-08-07  Czech Republic            U19 League     Sigma Olomouc U19                Karvina U19
6   2021-08-06 00:04:36.326391     2021-08-08          Russia          Youth League  Konoplev Academy U19            Rubin Kazan U19
7   2021-08-06 00:04:36.326391     2021-08-09           World         Club Friendly         Real Sociedad                      Eibar

desired df

                      run_time match_datetime         country                league             home_team                  away_team
0   2021-08-07 00:04:36.326391     2021-08-06          Russia       FNL 2 - Group 2             Yenisey 2          Lokomotiv-Kazanka
1   2021-08-07 00:04:36.326391     2021-08-07          Russia          Youth League              Ural U19  Krylya Sovetov Samara U19
4   2021-08-06 00:04:36.326391     2021-08-06           China          Super League              Wuhan FC       Tianjin Jinmen Tiger
5   2021-08-06 00:04:36.326391     2021-08-07  Czech Republic            U19 League     Sigma Olomouc U19                Karvina U19

How do i use rank function to filter only the 2 nearest match_datetime dates for every run_time value.
i.e. desired dataframe will be a filtered dataframe that will have all the nearest 2 match_datetime values for every run_time

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

千寻… 2025-02-03 00:48:03

更新

使用等级而不是head

diff = pd.to_datetime(df['run_time']).sub(pd.to_datetime(df['match_datetime'])).abs()
out = df.loc[diff.groupby(df['run_time']).rank(method='dense') <= 2]

输出:

>>> out
                     run_time match_datetime         country         league          home_team                  away_team
1  2021-08-07 00:04:36.326391     2021-08-07          Russia   Youth League           Ural U19  Krylya Sovetov Samara U19
2  2021-08-07 00:04:36.326391     2021-08-08           World  Club Friendly             Alaves                    Al Nasr
4  2021-08-06 00:04:36.326391     2021-08-06           China   Super League           Wuhan FC       Tianjin Jinmen Tiger
5  2021-08-06 00:04:36.326391     2021-08-07  Czech Republic     U19 League  Sigma Olomouc U19                Karvina U19

替代

您可以使用:

diff = pd.to_datetime(df['run_time']).sub(pd.to_datetime(df['match_datetime'])) \
                              .abs().sort_values()
out = df.loc[diff.groupby(df['run_time']).head(2).index].sort_index()

Update

Using rank instead of head:

diff = pd.to_datetime(df['run_time']).sub(pd.to_datetime(df['match_datetime'])).abs()
out = df.loc[diff.groupby(df['run_time']).rank(method='dense') <= 2]

Output:

>>> out
                     run_time match_datetime         country         league          home_team                  away_team
1  2021-08-07 00:04:36.326391     2021-08-07          Russia   Youth League           Ural U19  Krylya Sovetov Samara U19
2  2021-08-07 00:04:36.326391     2021-08-08           World  Club Friendly             Alaves                    Al Nasr
4  2021-08-06 00:04:36.326391     2021-08-06           China   Super League           Wuhan FC       Tianjin Jinmen Tiger
5  2021-08-06 00:04:36.326391     2021-08-07  Czech Republic     U19 League  Sigma Olomouc U19                Karvina U19

Alternative

You can use:

diff = pd.to_datetime(df['run_time']).sub(pd.to_datetime(df['match_datetime'])) \
                              .abs().sort_values()
out = df.loc[diff.groupby(df['run_time']).head(2).index].sort_index()
长不大的小祸害 2025-02-03 00:48:03

我以某种方式担心pandas.dataframe.rank方法无法执行此操作。但是pandas.dataframe.groupbyby可以使用pandas.dataframe.head与之一起执行此操作。

前提

import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame(np.array([np.random.randint(0, 3, 10), np.random.rand(10)]).transpose(), columns=['a', 'b'])

​/code>:

max_num_per_example = 2
df.groupby(['a']).head(max_num_per_example)

产生

AB
02.00.058084
10.00.866176
22.00.601115
40.0 0.00.020584
71.00.212339

的方法,这就是相同

max_idx_per_example = 2
idx_to_keep = []
for el_uq in df['a'].unique():
    lg = el_uq == df['a']
    for i, idx in enumerate(lg[lg].index):
        if i < max_idx_per_example:
            idx_to_keep.append(idx)
        else:
            break
df_new = df.iloc[idx_to_keep]

如果您使用天真 =)

I am somehow afraid that the pandas.DataFrame.rank method can't do this. But pandas.DataFrame.groupby can do this, if you use pandas.DataFrame.head with it.

Assuming you have the following pandas.DataFrame:

import pandas as pd
import numpy as np

np.random.seed(42)
df = pd.DataFrame(np.array([np.random.randint(0, 3, 10), np.random.rand(10)]).transpose(), columns=['a', 'b'])

And that you want to keep max_num_per_example = 2 representatives of each unique values in the column df['a']:

max_num_per_example = 2
df.groupby(['a']).head(max_num_per_example)

yields

ab
02.00.058084
10.00.866176
22.00.601115
40.00.020584
71.00.212339

This is the same as you would get if you to the naive approach:

max_idx_per_example = 2
idx_to_keep = []
for el_uq in df['a'].unique():
    lg = el_uq == df['a']
    for i, idx in enumerate(lg[lg].index):
        if i < max_idx_per_example:
            idx_to_keep.append(idx)
        else:
            break
df_new = df.iloc[idx_to_keep]

Which underlines the power of pandas =)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文