返回介绍

Scrubbing data

发布于 2025-02-25 23:43:37 字数 4708 浏览 0 评论 0 收藏 0

Scrubbing data refers to the preprocessing needed to prepare data for analysis. This may involve removing particular rows or columns, handling missing data, fixing inconsistencies due to data entry errors, transforming dates, generating derived variables, combining data from multiple sources, etc. Unfortunately, there is no one method that can handle all of the posisble data preprocessing needs; however, some familiarity with Python and packages such as those illustrated above will go a long way.

For a real-life example of the amount of work required, see the Bureau of Labor Statistics (US Government) example.

Here we will illustrate some simple data cleaning tasks that can be done with pandas .

%%file bad_data.csv
# This is a comment
# This is another comment
name,gender,weight,height
alice,f,60,1.56
bob,m,72,1.75
charles,m,,91
david,m,84,1.82
edgar,m,1.77,93
fanny,f,45,1.45
Overwriting bad_data.csv
# Supppose we wanted to find the average Body Mass Index (BMI)
# from the data set above

import pandas as pd

df = pd.read_csv('bad_data.csv', comment='#')
df.describe()
 weightheight
count5.0000006.000000
mean52.55400031.763333
std31.85325146.663594
min1.7700001.450000
25%45.0000001.607500
50%60.0000001.785000
75%72.00000068.705000
max84.00000093.000000

Something is strange - the average height is 31 meters!

# Plot the height and weight to see
plt.boxplot([df.weight, df.height]),;

df[df.height > 2]
 namegenderweightheight
2charlesmNaN91
4edgarm1.7793
# weight and height appear to have been swapped
# so we'll swap them back
idx = df.height > 2
df.ix[idx, 'height'], df.ix[idx, 'weight'] = df.ix[idx, 'weight'], df.ix[idx, 'height']
df[df.height > 2]
 namegenderweightheight
df
 namegenderweightheight
0alicef601.56
1bobm721.75
2charlesm91NaN
3davidm841.82
4edgarm931.77
5fannyf451.45
# we migth want to impute the missing height
# perhaps by predicting it from a model of the relationship
# bewtween height, weight and gender
# but for now we'll just ignore rows with mising data

df['BMI'] = df['weight']/(df['height']*df['height'])
df
 namegenderweightheightBMI
0alicef601.5624.654832
1bobm721.7523.510204
2charlesm91NaNNaN
3davidm841.8225.359256
4edgarm931.7729.684956
5fannyf451.4521.403092
# And finally, we calcuate the mean BMI by gender
df.groupby('gender')['BMI'].mean()
gender
f         23.028962
m         26.184806
Name: BMI, dtype: float64

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
    我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
    原文