返回介绍

Split-Apply-Combine

发布于 2025-02-25 23:43:39 字数 7533 浏览 0 评论 0 收藏 0

Many statistical summaries are in the form of split along some property, then apply a funciton to each subgroup and finally combine the results into some object. This is known as the ‘split-apply-combine’ pattern and implemnented in Pandas via groupby() and a function that can be applied to each subgroup.

# import a DataFrame to play with
try:
    tips = pd.read_pickle('tips.pic')
except:
    tips = pd.read_csv('https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/reshape2/tips.csv', )
    tips.to_pickle('tips.pic')
tips.head(n=4)
 Unnamed: 0total_billtipsexsmokerdaytimesize
0116.991.01FemaleNoSunDinner2
1210.341.66MaleNoSunDinner3
2321.013.50MaleNoSunDinner3
3423.683.31MaleNoSunDinner2
# We have an extra set of indices in the first column
# Let's get rid of it

tips = tips.ix[:, 1:]
tips.head(n=4)
 total_billtipsexsmokerdaytimesize
016.991.01FemaleNoSunDinner2
110.341.66MaleNoSunDinner3
221.013.50MaleNoSunDinner3
323.683.31MaleNoSunDinner2
# For an example of the split-apply-combine pattern, we want to see counts by sex and smoker status.
# In other words, we split by sex and smoker status to get 2x2 groups,
# then apply the size function to count the number of entries per group
# and finally combine the results into a new multi-index Series

grouped = tips.groupby(['sex', 'smoker'])
grouped.size()
sex     smoker
Female  No        54
        Yes       33
Male    No        97
        Yes       60
dtype: int64
# If you need the margins, use the crosstab function

pd.crosstab(tips.sex, tips.smoker, margins=True)
smokerNoYesAll
sex   
Female543387
Male9760157
All15193244
# If more than 1 column of resutls is generated, a DataFrame is returned

grouped.mean()
  total_billtipsize
sexsmoker   
FemaleNo18.1051852.7735192.592593
Yes17.9778792.9315152.242424
MaleNo19.7912373.1134022.711340
Yes22.2845003.0511672.500000
# The returned results can be further manipulated via apply()
# For example, suppose the bill and tips are in USD but we want EUR

import json
import urllib

# get current conversion rate
converter = json.loads(urllib.urlopen('http://rate-exchange.appspot.com/currency?from=USD&to=EUR').read())
print converter
grouped['total_bill', 'tip'].mean().apply(lambda x: x*converter['rate'])
{u'to': u'EUR', u'rate': 0.879191, u'from': u'USD'}
  total_billtip
sexsmoker  
FemaleNo15.9179162.438453
Yes15.8059892.577362
MaleNo17.4002782.737275
Yes19.5923322.682558
# We can also transform the original data for more convenient analysis
# For example, suppose we want standardized units for total bill and tips

zscore = lambda x: (x - x.mean())/x.std()

std_grouped = grouped['total_bill', 'tip'].transform(zscore)
std_grouped.head(n=4)
 total_billtip
0-0.153049-1.562813
1-1.083042-0.975727
20.1396610.259539
30.4456230.131984
# Suppose we want to apply a set of functions to only some columns
grouped['total_bill', 'tip'].agg(['mean', 'min', 'max'])
  total_billtip
  meanminmaxmeanminmax
sexsmoker      
FemaleNo18.1051857.2535.832.7735191.005.2
Yes17.9778793.0744.302.9315151.006.5
MaleNo19.7912377.5148.333.1134021.259.0
Yes22.2845007.2550.813.0511671.0010.0
# We can also apply specific functions to specific columns
df = grouped.agg({'total_bill': (min, max), 'tip': sum})
df
  tiptotal_bill
  summinmax
sexsmoker   
FemaleNo149.777.2535.83
Yes96.743.0744.30
MaleNo302.007.5148.33
Yes183.077.2550.81

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
    我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
    原文