从火花数据框架上的熊猫执行预处理操作
我有一个相当大的CSV,因此我使用AWS EMR将数据读取到Spark DataFrame中以执行一些操作。我有一个熊猫功能,可以进行一些简单的预处理:
def clean_census_data(df):
"""
This function cleans the dataframe and drops columns that contain 70% NaN values
"""
# Replace None or 0 with np.nan
df = df.replace('None', np.nan)
# Replace weird numbers
df = df.replace(-666666666.0, np.nan)
# Drop columns that contain 70% NaN or 0 values
df = df.loc[:, df.isnull().mean() < .7]
return df
我想将此函数应用于Spark DataFrame,但是功能并不相同。我不熟悉Spark,并且在Pandas中执行这些相当简单的操作对我来说并不明显,如何在Spark中执行相同的操作。我知道我可以将火花数据框架转换为熊猫,但这似乎不是很有效。
I have a rather large CSV so I am using AWS EMR to read the data into a Spark dataframe to perform some operations. I have a pandas function that does some simple preprocessing:
def clean_census_data(df):
"""
This function cleans the dataframe and drops columns that contain 70% NaN values
"""
# Replace None or 0 with np.nan
df = df.replace('None', np.nan)
# Replace weird numbers
df = df.replace(-666666666.0, np.nan)
# Drop columns that contain 70% NaN or 0 values
df = df.loc[:, df.isnull().mean() < .7]
return df
I want to apply this function onto a Spark dataframe, but the functions are not the same. I am not familiar with Spark and performing these rather simple operations in pandas is not obvious to me how to perform the same operations in Spark. I know I can convert a Spark dataframe into pandas, but that does not seem very efficient.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
第一个答案,所以请友好。此功能应与pyspark数据框架而不是pandas dataframes一起使用,并且应给您类似的结果:
注意:结果框架中的不包含无代而不是np.nan。
First answer, so please be kind. This function should work with pyspark dataframes instead of pandas dataframes, and should give you similar results:
Attention: The resulting dataframe contains None instead of np.nan.
本机火花功能可以为每一列进行这样的聚合。
以下数据帧包含零,NAN和零的百分比。
示例:
剩下的只是从DF1中选择列:
该百分比是根据此条件计算的,请根据您的需求进行更改:
f.isnan(c)| f.isnull(c)| (f.col(c)== 0)
这将用np.nan替换为np.nan:
df.fillna(np.nan)
这将用NP.NAN替换指定的值:
df.replace(-666666666,np.nan)
Native Spark functions can do such aggregation for every column.
The following dataframe contains the percentage of nulls, nans and zeros.
With an example:
What remains is just selecting the columns from df1:
The percentage is calculated based on this condition, change it according to your needs:
F.isnan(c) | F.isnull(c) | (F.col(c) == 0)
This would replace None with np.nan:
df.fillna(np.nan)
This would replace specified value with np.nan:
df.replace(-666666666, np.nan)