Python Pandas 分段错误 - 将列求和在一起
我正在开展一个日常幻想运动项目。
我有一个数据框,其中包含可能的阵容(6 列,阵容中的每个球员 1 列)。
作为我的流程的一部分,我为所有玩家生成一个可能的幻想点值。
接下来,我想通过引用幻想得分数据框来计算我的阵容数据框中阵容的总得分。
供参考:
- 阵容数据框:列 = F1、F2、F3、F4、F5、F6,其中每列是玩家姓名 + '_' + 他们的玩家 id
- 幻想点数据框:列 = 玩家 + ID、幻想点
I go 列6 名玩家获得 6 个幻想点值的列:
for col in ['F1', 'F2', 'F3', 'F4', 'F5', 'F6']:
lineups = lineups.join(sim_data[['Name_SlateID', 'Points']].set_index('Name_SlateID'), how='left', on=f'{col}', rsuffix = 'x')
然后,在我认为最简单的部分中,我尝试总结它们,然后得到 Segmentation Failure: 11
sum_columns = ['F1_points', 'F2_points', 'F3_points', 'F4_points', 'F5_points', 'F6_points']
lineups = reduce_memory_usage(lineups)
lineups[f'sim_{i}_points'] = lineups[sum_columns].sum(axis=1, skipna=True)
reduce_memory_usage comes来自这篇文章: https://towardsdatascience.com/6-pandas-mistakes-that-silently-tell-you-are-a-rookie-b566a252e60d
在运行此命令之前,我已将数据帧的内存减少了 50%通过选择正确的数据类型,我尝试使用 pd.eval() 代替,我尝试通过 for 循环对列进行一一求和,但似乎没有任何效果。
非常感谢任何帮助!
编辑: 规格:操作系统 - MacOS Monterey 12.2.1、python - 3.8.8、pandas - 1.4.1
以下是导致错误的行之前我的阵容数据帧的详细信息:
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 F1 107056 non-null object
1 F2 107056 non-null object
2 F3 107056 non-null object
3 F4 107056 non-null object
4 F5 107056 non-null object
5 F6 107056 non-null object
6 F1_own 107056 non-null float16
7 F1_salary 107056 non-null int16
8 F2_own 107056 non-null float16
9 F2_salary 107056 non-null int16
10 F3_own 107056 non-null float16
11 F3_salary 107056 non-null int16
12 F4_own 107056 non-null float16
13 F4_salary 107056 non-null int16
14 F5_own 107056 non-null float16
15 F5_salary 107056 non-null int16
16 F6_own 107056 non-null float16
17 F6_salary 107056 non-null int16
18 total_salary 107056 non-null int32
19 dupes 107056 non-null float32
20 over_600_frequency 107056 non-null int8
21 F1_points 107056 non-null float16
22 F2_points 107056 non-null float16
23 F3_points 107056 non-null float16
24 F4_points 107056 non-null float16
25 F5_points 107056 non-null float16
26 F6_points 107056 non-null float16
dtypes: float16(12), float32(1), int16(6), int32(1), int8(1), object(6)
memory usage: 10.3+ MB
I am working on a project for daily fantasy sports.
I have a dataframe containing possible lineups in it (6 columns, 1 for each player in a lineup).
As part of my process, I generate a possible fantasy point value for all players.
Next, I want to total the points scored for a lineup in my lineups dataframe by referencing the fantasy points dataframe.
For reference:
- Lineups Dataframe: columns = F1, F2, F3, F4, F5, F6 where each column is a player's name + '_' + their player id
- Fantasy Points Dataframe: columns = Player + ID, Fantasy Points
I go column by column for the 6 players to get the 6 fantasy points values:
for col in ['F1', 'F2', 'F3', 'F4', 'F5', 'F6']:
lineups = lineups.join(sim_data[['Name_SlateID', 'Points']].set_index('Name_SlateID'), how='left', on=f'{col}', rsuffix = 'x')
Then, in what I thought would be the simplest part, I try to sum them up and I get Segmentation Fault: 11
sum_columns = ['F1_points', 'F2_points', 'F3_points', 'F4_points', 'F5_points', 'F6_points']
lineups = reduce_memory_usage(lineups)
lineups[f'sim_{i}_points'] = lineups[sum_columns].sum(axis=1, skipna=True)
reduce_memory_usage comes from this article: https://towardsdatascience.com/6-pandas-mistakes-that-silently-tell-you-are-a-rookie-b566a252e60d
I have reduced the memory of the dataframe by 50% before running this line by choosing correct dtypes, I have tried using pd.eval() instead, I have tried summing the columns one by one via a for loop and nothing ever seems to work.
Any help is greatly appreciated!
Edit:
Specs: OS - MacOS Monterey 12.2.1, python - 3.8.8, pandas - 1.4.1
Here are the details of my lineups dataframe right before the line causing the error:
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 F1 107056 non-null object
1 F2 107056 non-null object
2 F3 107056 non-null object
3 F4 107056 non-null object
4 F5 107056 non-null object
5 F6 107056 non-null object
6 F1_own 107056 non-null float16
7 F1_salary 107056 non-null int16
8 F2_own 107056 non-null float16
9 F2_salary 107056 non-null int16
10 F3_own 107056 non-null float16
11 F3_salary 107056 non-null int16
12 F4_own 107056 non-null float16
13 F4_salary 107056 non-null int16
14 F5_own 107056 non-null float16
15 F5_salary 107056 non-null int16
16 F6_own 107056 non-null float16
17 F6_salary 107056 non-null int16
18 total_salary 107056 non-null int32
19 dupes 107056 non-null float32
20 over_600_frequency 107056 non-null int8
21 F1_points 107056 non-null float16
22 F2_points 107056 non-null float16
23 F3_points 107056 non-null float16
24 F4_points 107056 non-null float16
25 F5_points 107056 non-null float16
26 F6_points 107056 non-null float16
dtypes: float16(12), float32(1), int16(6), int32(1), int8(1), object(6)
memory usage: 10.3+ MB
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
分段错误 11 表示您使用了大约 8GB 内存。作为备份计划,有一些云解决方案(例如 AWS、GCP、Azure)可以为您提供足够的内存,Colab 是免费的,可能足以满足您的需求。
就解决根本问题而言,如果您的日期集太大,则可能无法在此处使用 pandas。我还想看看您是否可以将 sim_data[['Name_SlateID', 'Points']] 存储在内存中,这样它就不会重新计算,并且您可以删除已经加入的数据帧,例如 这个。这些有帮助吗?
Segmentation fault 11 means you're using about 8gb of memory. As a backup plan, there are cloud solutions (e.g. AWS, GCP, Azure) that will give you more than enough memory, Colab is free and might be enough for your needs.
As far as fixing the underlying problem, it might be impossible to use pandas here if your dateset is too big. I would also see if you could store sim_data[['Name_SlateID', 'Points']] in memory so it doesn't recompute, and you can delete already joined dataframes like this. Does any of that help?