Many statistical summaries are in the form of split along some property, then apply a funciton to each subgroup and finally combine the results into some object. This is known as the ‘split-apply-combine’ pattern and implemnented in Pandas via groupby() and a function that can be applied to each subgroup.
# import a DataFrame to play with
try:
tips = pd.read_pickle('tips.pic')
except:
tips = pd.read_csv('https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/reshape2/tips.csv', )
tips.to_pickle('tips.pic')
tips.head(n=4)
| Unnamed: 0 | total_bill | tip | sex | smoker | day | time | size |
---|
0 | 1 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
---|
1 | 2 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
---|
2 | 3 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
---|
3 | 4 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
---|
# We have an extra set of indices in the first column
# Let's get rid of it
tips = tips.ix[:, 1:]
tips.head(n=4)
| total_bill | tip | sex | smoker | day | time | size |
---|
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
---|
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
---|
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
---|
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
---|
# For an example of the split-apply-combine pattern, we want to see counts by sex and smoker status.
# In other words, we split by sex and smoker status to get 2x2 groups,
# then apply the size function to count the number of entries per group
# and finally combine the results into a new multi-index Series
grouped = tips.groupby(['sex', 'smoker'])
grouped.size()
sex smoker
Female No 54
Yes 33
Male No 97
Yes 60
dtype: int64
# If you need the margins, use the crosstab function
pd.crosstab(tips.sex, tips.smoker, margins=True)
smoker | No | Yes | All |
---|
sex | | | |
---|
Female | 54 | 33 | 87 |
---|
Male | 97 | 60 | 157 |
---|
All | 151 | 93 | 244 |
---|
# If more than 1 column of resutls is generated, a DataFrame is returned
grouped.mean()
| | total_bill | tip | size |
---|
sex | smoker | | | |
---|
Female | No | 18.105185 | 2.773519 | 2.592593 |
---|
Yes | 17.977879 | 2.931515 | 2.242424 |
---|
Male | No | 19.791237 | 3.113402 | 2.711340 |
---|
Yes | 22.284500 | 3.051167 | 2.500000 |
---|
# The returned results can be further manipulated via apply()
# For example, suppose the bill and tips are in USD but we want EUR
import json
import urllib
# get current conversion rate
converter = json.loads(urllib.urlopen('http://rate-exchange.appspot.com/currency?from=USD&to=EUR').read())
print converter
grouped['total_bill', 'tip'].mean().apply(lambda x: x*converter['rate'])
{u'to': u'EUR', u'rate': 0.879191, u'from': u'USD'}
| | total_bill | tip |
---|
sex | smoker | | |
---|
Female | No | 15.917916 | 2.438453 |
---|
Yes | 15.805989 | 2.577362 |
---|
Male | No | 17.400278 | 2.737275 |
---|
Yes | 19.592332 | 2.682558 |
---|
# We can also transform the original data for more convenient analysis
# For example, suppose we want standardized units for total bill and tips
zscore = lambda x: (x - x.mean())/x.std()
std_grouped = grouped['total_bill', 'tip'].transform(zscore)
std_grouped.head(n=4)
| total_bill | tip |
---|
0 | -0.153049 | -1.562813 |
---|
1 | -1.083042 | -0.975727 |
---|
2 | 0.139661 | 0.259539 |
---|
3 | 0.445623 | 0.131984 |
---|
# Suppose we want to apply a set of functions to only some columns
grouped['total_bill', 'tip'].agg(['mean', 'min', 'max'])
| | total_bill | tip |
---|
| | mean | min | max | mean | min | max |
---|
sex | smoker | | | | | | |
---|
Female | No | 18.105185 | 7.25 | 35.83 | 2.773519 | 1.00 | 5.2 |
---|
Yes | 17.977879 | 3.07 | 44.30 | 2.931515 | 1.00 | 6.5 |
---|
Male | No | 19.791237 | 7.51 | 48.33 | 3.113402 | 1.25 | 9.0 |
---|
Yes | 22.284500 | 7.25 | 50.81 | 3.051167 | 1.00 | 10.0 |
---|
# We can also apply specific functions to specific columns
df = grouped.agg({'total_bill': (min, max), 'tip': sum})
df
| | tip | total_bill |
---|
| | sum | min | max |
---|
sex | smoker | | | |
---|
Female | No | 149.77 | 7.25 | 35.83 |
---|
Yes | 96.74 | 3.07 | 44.30 |
---|
Male | No | 302.00 | 7.51 | 48.33 |
---|
Yes | 183.07 | 7.25 | 50.81 |
---|
发布评论