矢量化查找并更新熊猫DF
我有一个DF,这是我从组创建的相应词典,并将其附加到键。 DF有很多列,但这是重要的作品。
df:
key change_x x
0 2012_1_23_1 0 1
...
22 2012_1_23_1 0 1
23 2012_1_23_0 1 0
24 2012_1_23_0 0 0
...
46 2012_1_23_0 0 0
47 2012_1_23_1 1 0
47 2012_1_23_1 0 0
...
70 2012_1_23_1 0 0
71 2012_1_23_1 1 0
dict_df:
key x
0 2012_1_23_0 1
1 2012_1_23_1 0
使用df.to_dict ['records'] dict:dict forme i dict frame i
dict:
[{'key': '2012_1_23_0', 'x': 1},
{'key': '2012_1_23_1', 'x': 0}]
on the Dictionary和df都具有“键”中的键对。我创建了一个循环,该循环采用change_x变量并使用它来增加x,如果1,则将x的变量设置为x的键值,但对于20k行,在我的较大的df上需要2.5 s 400k行需要3分钟。如果我们假设相同的数据,这将是循环发生后的DF和词典。
循环代码:
def search_key_in_dicts(key, dict):
for d in dict:
if d['key'] == key:
return d
return None
def update_value_in_dicts(key, dict, col, value):
dict_key = search_key_in_dicts(key, dict)
dict_key.update({col : value})
def increment_x_value(key, dict):
update_value_in_dicts(key, dict, 'x', search_key_in_dicts(key, dict).get('x') + 1)
return search_key_in_dicts(key, dict).get('x')
for i in range(0,len(data)):
row = data.iloc[i]
if change_x == 1:
increment_x_value(row.key, dict)
data.at[row.name, 'x'] = (search_key_in_dicts(row.key, dict).get('x'))
DF:
key change_x x
0 2012_1_23_1 0 1
...
22 2012_1_23_1 0 1
23 2012_1_23_0 1 1
24 2012_1_23_0 0 1
...
46 2012_1_23_0 0 1
47 2012_1_23_1 1 2
48 2012_1_23_1 0 2
...
70 2012_1_23_1 0 2
71 2012_1_23_1 1 3
字典:
key x
0 2012_1_23_0 3
1 2012_1_23_1 1
我知道循环功能有效,如果我必须再次运行它,我想我可以在3分钟的性能时间内生活。我试图使用np.sher或pd.的速度更快地使它变得更快,但两者都没有起作用。这是我以前尝试的:
np.where(df['change_x'] == 1, increment_x(df['key'], dict), search_key_in_dicts(df['key'], dict)
但是我得到了一个错误:value error:系列的真实值是模棱两可的。
- 我最好的猜测是因为df ['key']可以映射到如此多的值为change_x。
此应用功能也一样:
def change_x_apply(key, change_x):
if change_x== 1:
increment_x_value(key, dict)
return search_key_in_dicts(key, dict).get('x')
df.apply(lambda x: change_x_apply(key = df['key'], x = df['change_x']), axis=1)
我能做的事情损失了,以使此运行时间降低。有什么建议吗?
I have a df and it's corresponding dictionary that I created from a groupby and attached it to a key. The df has a lot of columns, but here is the important piece.
df:
key change_x x
0 2012_1_23_1 0 1
...
22 2012_1_23_1 0 1
23 2012_1_23_0 1 0
24 2012_1_23_0 0 0
...
46 2012_1_23_0 0 0
47 2012_1_23_1 1 0
47 2012_1_23_1 0 0
...
70 2012_1_23_1 0 0
71 2012_1_23_1 1 0
dict_df:
key x
0 2012_1_23_0 1
1 2012_1_23_1 0
The dict dataframe I converted into a dictionary using df.to_dict['records']
dict:
[{'key': '2012_1_23_0', 'x': 1},
{'key': '2012_1_23_1', 'x': 0}]
Both the dictionary and the df have this key pair, in 'key'. I've created a loop that takes the change_x variable and use that to increment x if 1, and set the variable of x to the key value of x if 0, but it takes 2.5s for 20k rows, and on my larger df with 400k rows it takes over 3 minutes. This would be the df and dictionary after the loop takes place, if we assume the same data.
loop code:
def search_key_in_dicts(key, dict):
for d in dict:
if d['key'] == key:
return d
return None
def update_value_in_dicts(key, dict, col, value):
dict_key = search_key_in_dicts(key, dict)
dict_key.update({col : value})
def increment_x_value(key, dict):
update_value_in_dicts(key, dict, 'x', search_key_in_dicts(key, dict).get('x') + 1)
return search_key_in_dicts(key, dict).get('x')
for i in range(0,len(data)):
row = data.iloc[i]
if change_x == 1:
increment_x_value(row.key, dict)
data.at[row.name, 'x'] = (search_key_in_dicts(row.key, dict).get('x'))
df:
key change_x x
0 2012_1_23_1 0 1
...
22 2012_1_23_1 0 1
23 2012_1_23_0 1 1
24 2012_1_23_0 0 1
...
46 2012_1_23_0 0 1
47 2012_1_23_1 1 2
48 2012_1_23_1 0 2
...
70 2012_1_23_1 0 2
71 2012_1_23_1 1 3
dictionary:
key x
0 2012_1_23_0 3
1 2012_1_23_1 1
I know the loop function works, and I guess I could live with a 3 minute performance time if I do have to run this again. I was trying to get it faster using np.where or pd.apply, but neither worked. This is what I tried before:
np.where(df['change_x'] == 1, increment_x(df['key'], dict), search_key_in_dicts(df['key'], dict)
but I got this error: ValueError: The truth value of a Series is ambiguous.
- my best guess is because df['key'] can map to so many values of change_x.
Same for this apply function:
def change_x_apply(key, change_x):
if change_x== 1:
increment_x_value(key, dict)
return search_key_in_dicts(key, dict).get('x')
df.apply(lambda x: change_x_apply(key = df['key'], x = df['change_x']), axis=1)
Kind of at a loss at what I can do to get this runtime down. Any suggestions?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您需要完全重写代码:
You need rewrite your code completely: