Split-Apply-Combine

发布于 2025-02-25 23:43:39 字数 7533 浏览 0 评论 0 收藏 0

Many statistical summaries are in the form of split along some property, then apply a funciton to each subgroup and finally combine the results into some object. This is known as the ‘split-apply-combine’ pattern and implemnented in Pandas via groupby() and a function that can be applied to each subgroup.

# import a DataFrame to play with
try:
    tips = pd.read_pickle('tips.pic')
except:
    tips = pd.read_csv('https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/reshape2/tips.csv', )
    tips.to_pickle('tips.pic')

tips.head(n=4)

	Unnamed: 0	total_bill	tip	sex	smoker	day	time	size
0	1	16.99	1.01	Female	No	Sun	Dinner	2
1	2	10.34	1.66	Male	No	Sun	Dinner	3
2	3	21.01	3.50	Male	No	Sun	Dinner	3
3	4	23.68	3.31	Male	No	Sun	Dinner	2

# We have an extra set of indices in the first column
# Let's get rid of it

tips = tips.ix[:, 1:]
tips.head(n=4)

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2

# For an example of the split-apply-combine pattern, we want to see counts by sex and smoker status.
# In other words, we split by sex and smoker status to get 2x2 groups,
# then apply the size function to count the number of entries per group
# and finally combine the results into a new multi-index Series

grouped = tips.groupby(['sex', 'smoker'])
grouped.size()

sex     smoker
Female  No        54
        Yes       33
Male    No        97
        Yes       60
dtype: int64

# If you need the margins, use the crosstab function

pd.crosstab(tips.sex, tips.smoker, margins=True)

smoker	No	Yes	All
sex
Female	54	33	87
Male	97	60	157
All	151	93	244

# If more than 1 column of resutls is generated, a DataFrame is returned

grouped.mean()

		total_bill	tip	size
sex	smoker
Female	No	18.105185	2.773519	2.592593
Female	Yes	17.977879	2.931515	2.242424
Male	No	19.791237	3.113402	2.711340
Male	Yes	22.284500	3.051167	2.500000

# The returned results can be further manipulated via apply()
# For example, suppose the bill and tips are in USD but we want EUR

import json
import urllib

# get current conversion rate
converter = json.loads(urllib.urlopen('http://rate-exchange.appspot.com/currency?from=USD&to=EUR').read())
print converter
grouped['total_bill', 'tip'].mean().apply(lambda x: x*converter['rate'])

{u'to': u'EUR', u'rate': 0.879191, u'from': u'USD'}

		total_bill	tip
sex	smoker
Female	No	15.917916	2.438453
Female	Yes	15.805989	2.577362
Male	No	17.400278	2.737275
Male	Yes	19.592332	2.682558

# We can also transform the original data for more convenient analysis
# For example, suppose we want standardized units for total bill and tips

zscore = lambda x: (x - x.mean())/x.std()

std_grouped = grouped['total_bill', 'tip'].transform(zscore)
std_grouped.head(n=4)

	total_bill	tip
0	-0.153049	-1.562813
1	-1.083042	-0.975727
2	0.139661	0.259539
3	0.445623	0.131984

# Suppose we want to apply a set of functions to only some columns
grouped['total_bill', 'tip'].agg(['mean', 'min', 'max'])

		total_bill			tip
		mean	min	max	mean	min	max
sex	smoker
Female	No	18.105185	7.25	35.83	2.773519	1.00	5.2
Female	Yes	17.977879	3.07	44.30	2.931515	1.00	6.5
Male	No	19.791237	7.51	48.33	3.113402	1.25	9.0
Male	Yes	22.284500	7.25	50.81	3.051167	1.00	10.0

# We can also apply specific functions to specific columns
df = grouped.agg({'total_bill': (min, max), 'tip': sum})
df

		tip	total_bill
		sum	min	max
sex	smoker
Female	No	149.77	7.25	35.83
Female	Yes	96.74	3.07	44.30
Male	No	302.00	7.51	48.33
Male	Yes	183.07	7.25	50.81

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

列表为空，暂无数据

Split-Apply-Combine

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。