什么是pandas中每个组中特定值的发生,过滤和计算特定值的有效方法?

发布于 2025-02-02 12:05:37 字数 1110 浏览 2 评论 0原文

嗨,怎么样?我有一个巨大的数据框架,正在尝试进行一个组,过滤,然后在每个组中计数特定事件的发生。我有工作的代码,但根本不适合扩展,这需要永远运行。有人可以通过快速执行相同计算的方式帮助我吗?以下是我到目前为止在一个虚拟示例中重现的内容:

dates = ['2012-03-30','2012-03-30','2012-03-30','2012-03-30','2012-03-30','2012-03-31','2012-03-31','2012-03-31','2012-03-31','2012-03-31']
person = ['dave','mike','mike','dave','mike','dave','dave','dave','mike','mike']
weather = ['rainy','sunny','cloudy','cloudy','rainy','sunny','cloudy','sunny','cloudy','rainy']
events = ['sneeze','cough','sneeze','sneeze','cough','cough','sneeze','cough','sneeze','sneeze']

df = pd.DataFrame({'date':dates,'person':person,'weather':weather,'event':events}) 

def sneeze_by_weather(df):
    num_sneeze = df[df['event']=='sneeze'].shape[0] 
    if num_sneeze==0:
        return 0
    else:
        return num_sneeze

df_transformed = df.groupby(['date','person','weather']).apply(lambda x: sneeze_by_weather(x)).reset_index()

链接到由此产生的dataFrame

是否有任何方法可以更快地执行此计算,以便在我有数百万行时缩放?

Hi how's it going? I have a giant dataframe and am trying to do a groupby, filter, then count within each group the occurrence of a particular event. The code I have works but doesn't scale well at all, it takes forever to run. Can someone help me with a fast way to perform the same computation? Below is what I have so far reproduced in a dummy example:

dates = ['2012-03-30','2012-03-30','2012-03-30','2012-03-30','2012-03-30','2012-03-31','2012-03-31','2012-03-31','2012-03-31','2012-03-31']
person = ['dave','mike','mike','dave','mike','dave','dave','dave','mike','mike']
weather = ['rainy','sunny','cloudy','cloudy','rainy','sunny','cloudy','sunny','cloudy','rainy']
events = ['sneeze','cough','sneeze','sneeze','cough','cough','sneeze','cough','sneeze','sneeze']

df = pd.DataFrame({'date':dates,'person':person,'weather':weather,'event':events}) 

def sneeze_by_weather(df):
    num_sneeze = df[df['event']=='sneeze'].shape[0] 
    if num_sneeze==0:
        return 0
    else:
        return num_sneeze

df_transformed = df.groupby(['date','person','weather']).apply(lambda x: sneeze_by_weather(x)).reset_index()

Link to resulting dataframe

Is there any way to perform this computation much faster so that it scales when I have millions of rows?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

娇纵 2025-02-09 12:05:37

在下面的我的代码中,您提供的数据框架的%%时间IT:

最慢的运行时间比最快的时间长4.03倍。这可能意味着中间结果被缓存。
1000循环,最佳5:495 µs每循环

df['event'].loc[(df['event'] == 'sneeze')] = 1
df['event'].loc[(df['event'] != 1)] = 0
df
索引日期天气事件
02012-03-03-30DaveRainy1
12012-03-03-30MikeSunny0
22012-03-03-30MikeCloudy
32012-03-30DaveCloudy1
42012-03-30MikeRainy0
52012-03-31DaveSunny0
62012-03-31DaveCloudy1
72012-03-31Dave DaveSunny0
82012-03-- 31MikeCloudy1
92012-03-31MikeRainy1

%%timeit for your provided dataframe with my code below:

The slowest run took 4.03 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 5: 495 µs per loop

df['event'].loc[(df['event'] == 'sneeze')] = 1
df['event'].loc[(df['event'] != 1)] = 0
df
indexdatepersonweatherevent
02012-03-30daverainy1
12012-03-30mikesunny0
22012-03-30mikecloudy1
32012-03-30davecloudy1
42012-03-30mikerainy0
52012-03-31davesunny0
62012-03-31davecloudy1
72012-03-31davesunny0
82012-03-31mikecloudy1
92012-03-31mikerainy1
时光礼记 2025-02-09 12:05:37
import pandas as pd

dates = ['2012-03-30','2012-03-30','2012-03-30','2012-03-30','2012-03-30','2012-03-31','2012-03-31','2012-03-31','2012-03-31','2012-03-31']
person = ['dave','mike','mike','dave','mike','dave','dave','dave','mike','mike']
weather = ['rainy','sunny','cloudy','cloudy','rainy','sunny','cloudy','sunny','cloudy','rainy']
events = ['sneeze','cough','sneeze','sneeze','cough','cough','sneeze','cough','sneeze','sneeze']

df = pd.DataFrame({'date':dates,'person':person,'weather':weather,'event':events}) 

df_transformed = pd.DataFrame(df.groupby(['date','person','weather','event'])['event'].count()).rename(columns = {'event':'count'}).reset_index()
df_transformed['0'] = np.where((df_transformed['event'] == 'sneeze') & (df_transformed['count'] != 0),
                              df_transformed['count'],
                              0)

df_transformed = df_transformed.drop(labels = ['event','count'], axis = 1)

df_transformed
import pandas as pd

dates = ['2012-03-30','2012-03-30','2012-03-30','2012-03-30','2012-03-30','2012-03-31','2012-03-31','2012-03-31','2012-03-31','2012-03-31']
person = ['dave','mike','mike','dave','mike','dave','dave','dave','mike','mike']
weather = ['rainy','sunny','cloudy','cloudy','rainy','sunny','cloudy','sunny','cloudy','rainy']
events = ['sneeze','cough','sneeze','sneeze','cough','cough','sneeze','cough','sneeze','sneeze']

df = pd.DataFrame({'date':dates,'person':person,'weather':weather,'event':events}) 

df_transformed = pd.DataFrame(df.groupby(['date','person','weather','event'])['event'].count()).rename(columns = {'event':'count'}).reset_index()
df_transformed['0'] = np.where((df_transformed['event'] == 'sneeze') & (df_transformed['count'] != 0),
                              df_transformed['count'],
                              0)

df_transformed = df_transformed.drop(labels = ['event','count'], axis = 1)

df_transformed
还在原地等你 2025-02-09 12:05:37

这应该是更快的

idx_cols = ['date','person','weather']
idx = pd.MultiIndex.from_frame(df[idx_cols])

df_transformed = (
    df.loc[df.event == 'sneeze', idx_cols]
      .value_counts()
      .reindex(idx, fill_value=0)
      .reset_index()
)

输出:

>>> df_transformed

         date person weather  0
0  2012-03-30   dave   rainy  1
1  2012-03-30   mike   sunny  0
2  2012-03-30   mike  cloudy  1
3  2012-03-30   dave  cloudy  1
4  2012-03-30   mike   rainy  0
5  2012-03-31   dave   sunny  0
6  2012-03-31   dave  cloudy  1
7  2012-03-31   dave   sunny  0
8  2012-03-31   mike  cloudy  1
9  2012-03-31   mike   rainy  1

其他选项是使用Merge

idx_cols = ['date','person','weather']

counts = (
    df.loc[df.event == 'sneeze', idx_cols]
      .value_counts()
      .reset_index()
)

df_transformed = ( 
    df[idx_cols].merge(counts, on=idx_cols, how='left')
                .fillna(0)
                .astype({0: int})  # convert the type of the new column (labeled 0) to int. It was float due to NaNs 
)

我过度复杂...实际上,这种方式更加简单

idx_cols = ['date','person','weather']

df_transformed = (
    df.assign(is_sneeze=(df.event == 'sneeze'))
      .groupby(idx_cols, as_index=False)['is_sneeze']
      .sum()
)

This should be faster

idx_cols = ['date','person','weather']
idx = pd.MultiIndex.from_frame(df[idx_cols])

df_transformed = (
    df.loc[df.event == 'sneeze', idx_cols]
      .value_counts()
      .reindex(idx, fill_value=0)
      .reset_index()
)

Output:

>>> df_transformed

         date person weather  0
0  2012-03-30   dave   rainy  1
1  2012-03-30   mike   sunny  0
2  2012-03-30   mike  cloudy  1
3  2012-03-30   dave  cloudy  1
4  2012-03-30   mike   rainy  0
5  2012-03-31   dave   sunny  0
6  2012-03-31   dave  cloudy  1
7  2012-03-31   dave   sunny  0
8  2012-03-31   mike  cloudy  1
9  2012-03-31   mike   rainy  1

Other option is to use merge

idx_cols = ['date','person','weather']

counts = (
    df.loc[df.event == 'sneeze', idx_cols]
      .value_counts()
      .reset_index()
)

df_transformed = ( 
    df[idx_cols].merge(counts, on=idx_cols, how='left')
                .fillna(0)
                .astype({0: int})  # convert the type of the new column (labeled 0) to int. It was float due to NaNs 
)

I was overcomplicating... actually, this way is much more straightforward

idx_cols = ['date','person','weather']

df_transformed = (
    df.assign(is_sneeze=(df.event == 'sneeze'))
      .groupby(idx_cols, as_index=False)['is_sneeze']
      .sum()
)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文