Python Pandas 分段错误 - 将列求和在一起

发布于 2025-01-17 05:59:01 字数 2894 浏览 1 评论 0原文

我正在开展一个日常幻想运动项目。

我有一个数据框,其中包含可能的阵容(6 列,阵容中的每个球员 1 列)。

作为我的流程的一部分,我为所有玩家生成一个可能的幻想点值。

接下来,我想通过引用幻想得分数据框来计算我的阵容数据框中阵容的总得分。

供参考:

  • 阵容数据框:列 = F1、F2、F3、F4、F5、F6,其中每列是玩家姓名 + '_' + 他们的玩家 id
  • 幻想点数据框:列 = 玩家 + ID、幻想点

I go 列6 名玩家获得 6 个幻想点值的列:

for col in ['F1', 'F2', 'F3', 'F4', 'F5', 'F6']:
    lineups = lineups.join(sim_data[['Name_SlateID', 'Points']].set_index('Name_SlateID'), how='left', on=f'{col}', rsuffix = 'x')

然后,在我认为最简单的部分中,我尝试总结它们,然后得到 Segmentation Failure: 11

sum_columns = ['F1_points', 'F2_points', 'F3_points', 'F4_points', 'F5_points', 'F6_points']

lineups = reduce_memory_usage(lineups)

lineups[f'sim_{i}_points'] = lineups[sum_columns].sum(axis=1, skipna=True)

reduce_memory_usage comes来自这篇文章: https://towardsdatascience.com/6-pandas-mistakes-that-silently-tell-you-are-a-rookie-b566a252e60d

在运行此命令之前,我已将数据帧的内存减少了 50%通过选择正确的数据类型,我尝试使用 pd.eval() 代替,我尝试通过 for 循环对列进行一一求和,但似乎没有任何效果。

非常感谢任何帮助!

编辑: 规格:操作系统 - MacOS Monterey 12.2.1、python - 3.8.8、pandas - 1.4.1

以下是导致错误的行之前我的阵容数据帧的详细信息:

Data columns (total 27 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   F1                  107056 non-null  object 
 1   F2                  107056 non-null  object 
 2   F3                  107056 non-null  object 
 3   F4                  107056 non-null  object 
 4   F5                  107056 non-null  object 
 5   F6                  107056 non-null  object 
 6   F1_own              107056 non-null  float16
 7   F1_salary           107056 non-null  int16  
 8   F2_own              107056 non-null  float16
 9   F2_salary           107056 non-null  int16  
 10  F3_own              107056 non-null  float16
 11  F3_salary           107056 non-null  int16  
 12  F4_own              107056 non-null  float16
 13  F4_salary           107056 non-null  int16  
 14  F5_own              107056 non-null  float16
 15  F5_salary           107056 non-null  int16  
 16  F6_own              107056 non-null  float16
 17  F6_salary           107056 non-null  int16  
 18  total_salary        107056 non-null  int32  
 19  dupes               107056 non-null  float32
 20  over_600_frequency  107056 non-null  int8   
 21  F1_points           107056 non-null  float16
 22  F2_points           107056 non-null  float16
 23  F3_points           107056 non-null  float16
 24  F4_points           107056 non-null  float16
 25  F5_points           107056 non-null  float16
 26  F6_points           107056 non-null  float16
dtypes: float16(12), float32(1), int16(6), int32(1), int8(1), object(6)
memory usage: 10.3+ MB

I am working on a project for daily fantasy sports.

I have a dataframe containing possible lineups in it (6 columns, 1 for each player in a lineup).

As part of my process, I generate a possible fantasy point value for all players.

Next, I want to total the points scored for a lineup in my lineups dataframe by referencing the fantasy points dataframe.

For reference:

  • Lineups Dataframe: columns = F1, F2, F3, F4, F5, F6 where each column is a player's name + '_' + their player id
  • Fantasy Points Dataframe: columns = Player + ID, Fantasy Points

I go column by column for the 6 players to get the 6 fantasy points values:

for col in ['F1', 'F2', 'F3', 'F4', 'F5', 'F6']:
    lineups = lineups.join(sim_data[['Name_SlateID', 'Points']].set_index('Name_SlateID'), how='left', on=f'{col}', rsuffix = 'x')

Then, in what I thought would be the simplest part, I try to sum them up and I get Segmentation Fault: 11

sum_columns = ['F1_points', 'F2_points', 'F3_points', 'F4_points', 'F5_points', 'F6_points']

lineups = reduce_memory_usage(lineups)

lineups[f'sim_{i}_points'] = lineups[sum_columns].sum(axis=1, skipna=True)

reduce_memory_usage comes from this article: https://towardsdatascience.com/6-pandas-mistakes-that-silently-tell-you-are-a-rookie-b566a252e60d

I have reduced the memory of the dataframe by 50% before running this line by choosing correct dtypes, I have tried using pd.eval() instead, I have tried summing the columns one by one via a for loop and nothing ever seems to work.

Any help is greatly appreciated!

Edit:
Specs: OS - MacOS Monterey 12.2.1, python - 3.8.8, pandas - 1.4.1

Here are the details of my lineups dataframe right before the line causing the error:

Data columns (total 27 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   F1                  107056 non-null  object 
 1   F2                  107056 non-null  object 
 2   F3                  107056 non-null  object 
 3   F4                  107056 non-null  object 
 4   F5                  107056 non-null  object 
 5   F6                  107056 non-null  object 
 6   F1_own              107056 non-null  float16
 7   F1_salary           107056 non-null  int16  
 8   F2_own              107056 non-null  float16
 9   F2_salary           107056 non-null  int16  
 10  F3_own              107056 non-null  float16
 11  F3_salary           107056 non-null  int16  
 12  F4_own              107056 non-null  float16
 13  F4_salary           107056 non-null  int16  
 14  F5_own              107056 non-null  float16
 15  F5_salary           107056 non-null  int16  
 16  F6_own              107056 non-null  float16
 17  F6_salary           107056 non-null  int16  
 18  total_salary        107056 non-null  int32  
 19  dupes               107056 non-null  float32
 20  over_600_frequency  107056 non-null  int8   
 21  F1_points           107056 non-null  float16
 22  F2_points           107056 non-null  float16
 23  F3_points           107056 non-null  float16
 24  F4_points           107056 non-null  float16
 25  F5_points           107056 non-null  float16
 26  F6_points           107056 non-null  float16
dtypes: float16(12), float32(1), int16(6), int32(1), int8(1), object(6)
memory usage: 10.3+ MB

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

滥情稳全场 2025-01-24 05:59:01

分段错误 11 表示您使用了大约 8GB 内存。作为备份计划,有一些云解决方案(例如 AWS、GCP、Azure)可以为您提供足够的内存,Colab 是免费的,可能足以满足您的需求。

就解决根本问题而言,如果您的日期集太大,则可能无法在此处使用 pandas。我还想看看您是否可以将 sim_data[['Name_SlateID', 'Points']] 存储在内存中,这样它就不会重新计算,并且您可以删除已经加入的数据帧,例如 这个。这些有帮助吗?

Segmentation fault 11 means you're using about 8gb of memory. As a backup plan, there are cloud solutions (e.g. AWS, GCP, Azure) that will give you more than enough memory, Colab is free and might be enough for your needs.

As far as fixing the underlying problem, it might be impossible to use pandas here if your dateset is too big. I would also see if you could store sim_data[['Name_SlateID', 'Points']] in memory so it doesn't recompute, and you can delete already joined dataframes like this. Does any of that help?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文