矢量化查找并更新熊猫DF

发布于 2025-01-22 17:44:41 字数 2544 浏览 0 评论 0原文

我有一个DF，这是我从组创建的相应词典，并将其附加到键。 DF有很多列，但这是重要的作品。

df：

    key          change_x  x
0   2012_1_23_1  0         1
...
22  2012_1_23_1  0         1
23  2012_1_23_0  1         0
24  2012_1_23_0  0         0
...
46  2012_1_23_0  0         0
47  2012_1_23_1  1         0
47  2012_1_23_1  0         0
...
70  2012_1_23_1  0         0
71  2012_1_23_1  1         0

dict_df：

     key            x
0    2012_1_23_0    1
1    2012_1_23_1    0

使用df.to_dict ['records'] dict：dict forme i dict frame i

dict：

[{'key': '2012_1_23_0', 'x': 1},
 {'key': '2012_1_23_1', 'x': 0}]

on the Dictionary和df都具有“键”中的键对。我创建了一个循环，该循环采用change_x变量并使用它来增加x，如果1，则将x的变量设置为x的键值，但对于20k行，在我的较大的df上需要2.5 s 400k行需要3分钟。如果我们假设相同的数据，这将是循环发生后的DF和词典。

循环代码：

def search_key_in_dicts(key, dict):
    for d in dict:
        if d['key'] == key:
            return d
    return None
def update_value_in_dicts(key, dict, col, value):
    dict_key = search_key_in_dicts(key, dict)
    dict_key.update({col : value})
def increment_x_value(key, dict):
    update_value_in_dicts(key, dict, 'x', search_key_in_dicts(key, dict).get('x') + 1)
    return search_key_in_dicts(key, dict).get('x')
for i in range(0,len(data)):
    row = data.iloc[i]
    if change_x == 1:
        increment_x_value(row.key, dict)
    data.at[row.name, 'x'] = (search_key_in_dicts(row.key, dict).get('x'))

DF：

    key          change_x  x
0   2012_1_23_1  0         1
...
22  2012_1_23_1  0         1
23  2012_1_23_0  1         1
24  2012_1_23_0  0         1
...
46  2012_1_23_0  0         1
47  2012_1_23_1  1         2
48  2012_1_23_1  0         2
...
70  2012_1_23_1  0         2
71  2012_1_23_1  1         3

字典：

     key            x
0    2012_1_23_0    3
1    2012_1_23_1    1

我知道循环功能有效，如果我必须再次运行它，我想我可以在3分钟的性能时间内生活。我试图使用np.sher或pd.的速度更快地使它变得更快，但两者都没有起作用。这是我以前尝试的：

np.where(df['change_x'] == 1, increment_x(df['key'], dict), search_key_in_dicts(df['key'], dict)

但是我得到了一个错误：value error：系列的真实值是模棱两可的。 - 我最好的猜测是因为df ['key']可以映射到如此多的值为change_x。

此应用功能也一样：

def change_x_apply(key, change_x):
        if change_x== 1:
            increment_x_value(key, dict)
        return search_key_in_dicts(key, dict).get('x')
df.apply(lambda x: change_x_apply(key = df['key'], x = df['change_x']), axis=1)

我能做的事情损失了，以使此运行时间降低。有什么建议吗？

原文

I have a df and it's corresponding dictionary that I created from a groupby and attached it to a key. The df has a lot of columns, but here is the important piece.

df:

    key          change_x  x
0   2012_1_23_1  0         1
...
22  2012_1_23_1  0         1
23  2012_1_23_0  1         0
24  2012_1_23_0  0         0
...
46  2012_1_23_0  0         0
47  2012_1_23_1  1         0
47  2012_1_23_1  0         0
...
70  2012_1_23_1  0         0
71  2012_1_23_1  1         0

dict_df:

     key            x
0    2012_1_23_0    1
1    2012_1_23_1    0

The dict dataframe I converted into a dictionary using df.to_dict['records']

dict:

[{'key': '2012_1_23_0', 'x': 1},
 {'key': '2012_1_23_1', 'x': 0}]

Both the dictionary and the df have this key pair, in 'key'. I've created a loop that takes the change_x variable and use that to increment x if 1, and set the variable of x to the key value of x if 0, but it takes 2.5s for 20k rows, and on my larger df with 400k rows it takes over 3 minutes. This would be the df and dictionary after the loop takes place, if we assume the same data.

loop code:

def search_key_in_dicts(key, dict):
    for d in dict:
        if d['key'] == key:
            return d
    return None
def update_value_in_dicts(key, dict, col, value):
    dict_key = search_key_in_dicts(key, dict)
    dict_key.update({col : value})
def increment_x_value(key, dict):
    update_value_in_dicts(key, dict, 'x', search_key_in_dicts(key, dict).get('x') + 1)
    return search_key_in_dicts(key, dict).get('x')
for i in range(0,len(data)):
    row = data.iloc[i]
    if change_x == 1:
        increment_x_value(row.key, dict)
    data.at[row.name, 'x'] = (search_key_in_dicts(row.key, dict).get('x'))

df:

    key          change_x  x
0   2012_1_23_1  0         1
...
22  2012_1_23_1  0         1
23  2012_1_23_0  1         1
24  2012_1_23_0  0         1
...
46  2012_1_23_0  0         1
47  2012_1_23_1  1         2
48  2012_1_23_1  0         2
...
70  2012_1_23_1  0         2
71  2012_1_23_1  1         3

dictionary:

     key            x
0    2012_1_23_0    3
1    2012_1_23_1    1

I know the loop function works, and I guess I could live with a 3 minute performance time if I do have to run this again. I was trying to get it faster using np.where or pd.apply, but neither worked. This is what I tried before:

np.where(df['change_x'] == 1, increment_x(df['key'], dict), search_key_in_dicts(df['key'], dict)

but I got this error: ValueError: The truth value of a Series is ambiguous. - my best guess is because df['key'] can map to so many values of change_x.

Same for this apply function:

def change_x_apply(key, change_x):
        if change_x== 1:
            increment_x_value(key, dict)
        return search_key_in_dicts(key, dict).get('x')
df.apply(lambda x: change_x_apply(key = df['key'], x = df['change_x']), axis=1)

Kind of at a loss at what I can do to get this runtime down. Any suggestions?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

风启觞 2025-01-29 17:44:41

您需要完全重写代码：

#create default index
df = df.reset_index(drop=True)
#counter column by key
df['g'] = df.groupby('key').cumcount()

#merge df1 (used for generate dict) by first match, if no match set 0
df['new'] = (df.merge(df1.assign(g=0)
              .rename(columns={'x':'new'}), on=['key', 'g'], how='left'))['new']
              .fillna(0)
              .astype(int))

#sum both columns and use cumulative sum per key groups
df['x'] = (df['new'] + df['change_x']).groupby(df['key']).cumsum()

#delete helper columns, commanted for debugging
# df = df.drop(['g','new'], axis=1)
print (df)
           key  change_x  x  g  new
0  2012_1_23_1         0  0  0    0
1  2012_1_23_1         0  0  1    0
2  2012_1_23_0         1  2  0    1
3  2012_1_23_0         0  2  1    0
4  2012_1_23_0         0  2  2    0
5  2012_1_23_1         1  1  2    0
6  2012_1_23_1         0  1  3    0
7  2012_1_23_1         0  1  4    0
8  2012_1_23_1         1  2  5    0

df1 = df.drop_duplicates('key', keep='last')[['key','x']]
print (df1)
           key  x
4  2012_1_23_0  2
8  2012_1_23_1  2

You need rewrite your code completely:

#create default index
df = df.reset_index(drop=True)
#counter column by key
df['g'] = df.groupby('key').cumcount()

#merge df1 (used for generate dict) by first match, if no match set 0
df['new'] = (df.merge(df1.assign(g=0)
              .rename(columns={'x':'new'}), on=['key', 'g'], how='left'))['new']
              .fillna(0)
              .astype(int))

#sum both columns and use cumulative sum per key groups
df['x'] = (df['new'] + df['change_x']).groupby(df['key']).cumsum()

#delete helper columns, commanted for debugging
# df = df.drop(['g','new'], axis=1)
print (df)
           key  change_x  x  g  new
0  2012_1_23_1         0  0  0    0
1  2012_1_23_1         0  0  1    0
2  2012_1_23_0         1  2  0    1
3  2012_1_23_0         0  2  1    0
4  2012_1_23_0         0  2  2    0
5  2012_1_23_1         1  1  2    0
6  2012_1_23_1         0  1  3    0
7  2012_1_23_1         0  1  4    0
8  2012_1_23_1         1  2  5    0

df1 = df.drop_duplicates('key', keep='last')[['key','x']]
print (df1)
           key  x
4  2012_1_23_0  2
8  2012_1_23_1  2

回复收藏 0 原文

~没有更多了~