有没有更快的方法在 Python 中迭代行来计算特征?

发布于 2025-01-11 06:38:41 字数 1261 浏览 0 评论 0原文

我有一个 Pandas Dataframe df ,它详细说明了玩游戏的玩家的姓名。数据框有两列,分别是他们玩游戏的“日期”和他们的名字,按日期排序。

日期姓名
1993-03-28Tom
1993-03-28Joe
1993-03-29Tom
1993-03-30Joe

我想要完成的是高效地计算每个玩家在玩之前玩过的游戏数量当天即将进行的比赛。

对于上面的示例 Dataframe,计算玩家之前的游戏数量将从 0 开始,如下所示。

日期名称以前的游戏
1993-03-28Tom0
1993-03-28Joe0
1993-03-29Tom1
1993-03-30Joe1

我尝试了以下代码,虽然他们提供了正确的结果,但他们花了很多时间我的电脑运行的天数。

尝试 1:

for i in range(0, len(df) ):
   df['Previous Games'][i] = len( df[ (df['Name'] == df['Name'][i]) & (df['Date'] < df['Date'][i]) ] )

尝试 2:

df['Previous Games'] = [ len( df[ (df['Name'] == df['Name'][i]) & (df['Date'] < df['Date'][i]) ] ) for i in range(0, len(df) ) ]

虽然尝试 2 稍微快一些,但它仍然不节省时间,因此我需要帮助来找到更快的方法。

I have a Pandas Dataframe df that details Names of players that play a game. The Dataframe has 2 columns of 'Date' they played a game and their name, sorted by Date.

DateName
1993-03-28Tom
1993-03-28Joe
1993-03-29Tom
1993-03-30Joe

What I am trying to accomplish is to time-efficiently calculate the previous number of games each player has played before they play the upcoming game that day.

For the example Dataframe above, calculating the players previous number of games would start at 0 and look like follows.

DateNamePrevious Games
1993-03-28Tom0
1993-03-28Joe0
1993-03-29Tom1
1993-03-30Joe1

I have tried the following codes and although they have delivered the correct result, they took many days for my computer to run.

Attempt 1:

for i in range(0, len(df) ):
   df['Previous Games'][i] = len( df[ (df['Name'] == df['Name'][i]) & (df['Date'] < df['Date'][i]) ] )

Attempt 2:

df['Previous Games'] = [ len( df[ (df['Name'] == df['Name'][i]) & (df['Date'] < df['Date'][i]) ] ) for i in range(0, len(df) ) ]

Although Attempt 2 was slightly quicker, it was still not time-efficient so I need help in finding a faster method.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

故事和酒 2025-01-18 06:38:41

任何时候你把“for”和“pandas”写得很近,你就可能做错了什么。

在我看来你想要累积计数:

df["prev_games"] = df.sort_values('Date').groupby('Name').cumcount()

Any time you write "for" and "pandas" anywhere close together you are probably doing something wrong.

It seems to me you want the cumulative count:

df["prev_games"] = df.sort_values('Date').groupby('Name').cumcount()
烟酉 2025-01-18 06:38:41

是的,更快的方法应该是避免显式的 for 循环。您可以对每个名称的数据框进行分组,然后 .按“日期”对行进行排名

>>> df["Previous Games"] = df.groupby("Name")["Date"].rank("dense") - 1

添加-1以从0开始。

Yes, a quicker way should be to avoid explicit for loops. You could group the dataframe for each name, and then .rank the rows by "Date":

>>> df["Previous Games"] = df.groupby("Name")["Date"].rank("dense") - 1

The -1 was added to start from 0.

不及他 2025-01-18 06:38:41

这是一个 pandas 问题,而不是 python 问题。

在处理 pandas 数据帧时,有多种选项可以避免 for 循环。最直接的是以下这一点:

# To recreate a dummy dataset:
se = pd.date_range(start='2016-01-01', end='2020-12-31', freq='D')
df = pd.DataFrame({"Date": se, "Name": list(np.random.choice(("joe", "bob", "alice"), len(se)))})

# To add the previous games column
df['Previous Games'] = df.apply(lambda row: ((row["Date"] > df["Date"]) * (row["Name"] == df["Name"])).sum(), axis=1)

This is a pandas question rather than a python one.

There are several options to avoid a for cycle when dealing with pandas dataframes. The most immediate is the following one:

# To recreate a dummy dataset:
se = pd.date_range(start='2016-01-01', end='2020-12-31', freq='D')
df = pd.DataFrame({"Date": se, "Name": list(np.random.choice(("joe", "bob", "alice"), len(se)))})

# To add the previous games column
df['Previous Games'] = df.apply(lambda row: ((row["Date"] > df["Date"]) * (row["Name"] == df["Name"])).sum(), axis=1)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文