在检查字符串列并将错误值保存到数据abrick时,请在内存中。
我需要在数据框架中进行双引号检查。因此,我在所有列中迭代此检查,但需要大量时间。我正在使用Azure Databricks。
for column in columns_list:
column_name = "`" + column + "`"
df_reject = source_data.withColumn("flag_quotes",when(source_data[column_name].rlike("[\"\"]"),lit("Yes")).otherwise(lit("No")))
df_quo_rejected_df = df_reject.filter(col("flag_quotes") == "Yes")
df_quo_rejected_df = df_quo_rejected_df.withColumn('Error', lit(err))
df_quo_rejected_df.coalesce(1).write.mode("append").option("header","true")\
.option("delimiter",delimiter)\
.format("com.databricks.spark.csv")\
.save(filelocwrite)
我有大约500列,有4000万张记录。我尝试了Union数据范围的每次迭代,但是某个时候操作确实可以使用。因此,我保存数据框并将其附加到每次迭代。请帮助我以一种优化运行时间的方法。
I need to do a double quotes check in a dataframe. So I am iterating through all the columns for this check but takes lot of time. I am using Azure Databricks for this.
for column in columns_list:
column_name = "`" + column + "`"
df_reject = source_data.withColumn("flag_quotes",when(source_data[column_name].rlike("[\"\"]"),lit("Yes")).otherwise(lit("No")))
df_quo_rejected_df = df_reject.filter(col("flag_quotes") == "Yes")
df_quo_rejected_df = df_quo_rejected_df.withColumn('Error', lit(err))
df_quo_rejected_df.coalesce(1).write.mode("append").option("header","true")\
.option("delimiter",delimiter)\
.format("com.databricks.spark.csv")\
.save(filelocwrite)
I have got around 500 columns with 40 million records. I tried union the dataframes every iteration but the operation does OOM after sometime. So I save the dataframe and append it every iteration. Please help me with a way to optimize the running time.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您可以尝试使用
Instead of looping through columns you can try checking their values using
exists
.