Scrubbing data

发布于 2025-02-25 23:43:37 字数 4708 浏览 0 评论 0 收藏 0

Scrubbing data refers to the preprocessing needed to prepare data for analysis. This may involve removing particular rows or columns, handling missing data, fixing inconsistencies due to data entry errors, transforming dates, generating derived variables, combining data from multiple sources, etc. Unfortunately, there is no one method that can handle all of the posisble data preprocessing needs; however, some familiarity with Python and packages such as those illustrated above will go a long way.

For a real-life example of the amount of work required, see the Bureau of Labor Statistics (US Government) example.

Here we will illustrate some simple data cleaning tasks that can be done with pandas .

%%file bad_data.csv
# This is a comment
# This is another comment
name,gender,weight,height
alice,f,60,1.56
bob,m,72,1.75
charles,m,,91
david,m,84,1.82
edgar,m,1.77,93
fanny,f,45,1.45

Overwriting bad_data.csv

# Supppose we wanted to find the average Body Mass Index (BMI)
# from the data set above

import pandas as pd

df = pd.read_csv('bad_data.csv', comment='#')

df.describe()

	weight	height
count	5.000000	6.000000
mean	52.554000	31.763333
std	31.853251	46.663594
min	1.770000	1.450000
25%	45.000000	1.607500
50%	60.000000	1.785000
75%	72.000000	68.705000
max	84.000000	93.000000

Something is strange - the average height is 31 meters!

# Plot the height and weight to see
plt.boxplot([df.weight, df.height]),;

df[df.height > 2]

	name	gender	weight	height
2	charles	m	NaN	91
4	edgar	m	1.77	93

# weight and height appear to have been swapped
# so we'll swap them back
idx = df.height > 2
df.ix[idx, 'height'], df.ix[idx, 'weight'] = df.ix[idx, 'weight'], df.ix[idx, 'height']
df[df.height > 2]

	name	gender	weight	height

df

	name	gender	weight	height
0	alice	f	60	1.56
1	bob	m	72	1.75
2	charles	m	91	NaN
3	david	m	84	1.82
4	edgar	m	93	1.77
5	fanny	f	45	1.45

# we migth want to impute the missing height
# perhaps by predicting it from a model of the relationship
# bewtween height, weight and gender
# but for now we'll just ignore rows with mising data

df['BMI'] = df['weight']/(df['height']*df['height'])
df

	name	gender	weight	height	BMI
0	alice	f	60	1.56	24.654832
1	bob	m	72	1.75	23.510204
2	charles	m	91	NaN	NaN
3	david	m	84	1.82	25.359256
4	edgar	m	93	1.77	29.684956
5	fanny	f	45	1.45	21.403092

# And finally, we calcuate the mean BMI by gender
df.groupby('gender')['BMI'].mean()

gender
f         23.028962
m         26.184806
Name: BMI, dtype: float64

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

列表为空，暂无数据

Scrubbing data

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。