根据条件即将到来的表单列表总结价值和dict
我在列表中有12个时期,这些列表并未根据所选产品具有更多。 另外,我有一个dict,其键和产品作为值列表。
{
"20191": ["prod1","prod2","prod3"],
"20192": ["prod2","prod3"],
"20193": ["prod2"]
}
II需要根据周期选择数据并计算各个期间的总和。
sample_data
周期 | 产品 | 金额 |
---|---|---|
20191 | pod1 | 30 |
20192 | pod1 | 30 |
20191 | prod2 | 20 |
20191 | prod3 | 60 |
20193 | pod1 pod1 | 30 |
20193 | prod2 | 30 |
输出
期 | 产品 | 金额 |
---|---|---|
20191 | pod1 | 110 |
20192 | 0 | |
20192 20193 | prod3 | 30 |
基本上,对于每个时期,从dict,仅选择这些时期总结一下。
我的代码需要大量时间:
list_series = []
df = spark.read.csv(path,header=True)
periods = df.select("period").distinct().collect()
for period in periods:
df1 = df.filter(f"period = {period}").filter(F.col("product").isin(dict["period"]).groupBy("priod","product").agg(F.sum("Amount").alias("Amount")
list_series.append(df1)
dataframe = reduce(DataFrame.unionAll,list_series)
有什么办法,我可以修改并提高性能?
I have min 12 periods in list, these are not fixed might have more based on the selected product.
Also, I have a dict which has period as key and products as list of values.
{
"20191": ["prod1","prod2","prod3"],
"20192": ["prod2","prod3"],
"20193": ["prod2"]
}
II need to select the data based on period and compute the sum of the respective period, amount.
sample_data
period | product | amount |
---|---|---|
20191 | prod1 | 30 |
20192 | prod1 | 30 |
20191 | prod2 | 20 |
20191 | prod3 | 60 |
20193 | prod1 | 30 |
20193 | prod2 | 30 |
output
period | product | amount |
---|---|---|
20191 | prod1 | 110 |
20192 | 0 | |
20193 | prod3 | 30 |
Basically, for each of the period, select only those products from the dict, and sum it up.
My code which is taking lot of time:
list_series = []
df = spark.read.csv(path,header=True)
periods = df.select("period").distinct().collect()
for period in periods:
df1 = df.filter(f"period = {period}").filter(F.col("product").isin(dict["period"]).groupBy("priod","product").agg(F.sum("Amount").alias("Amount")
list_series.append(df1)
dataframe = reduce(DataFrame.unionAll,list_series)
Is there any way, I can modify and increase the performance?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
解决方案
将输入字典变义为单元列表,然后创建一个新的Spark DataFrame,称为
efferters
,然后JOIN
与原始数据框一起使用列的原始 ocage> ocipers 和产品
,然后groupby期间
和汇总金额
使用sum
结果
Solution
Flatten the input dictionary into list of tuples then create a new spark dataframe called
filters
, thenjoin
this dataframe with the original one by columnsperiods
andproduct
, then groupbyperiod
and aggregateamount
usingsum
Result
通过以下输入:
要使您的脚本更具性能,您将需要删除从一个人创建多个DF的步骤,然后将它们全部重新组合在一起。在一个数据框架中执行此操作而无需分开。
您可以在python中创建过滤器条件(在联接应添加性能提升之前过滤器),将其提供给过滤器功能,然后汇总。
With the following input:
To make your script more performant, you will need to remove steps which create several dfs FROM ONE and then union them all back together. Do it in one dataframe without splitting.
You can create the filter condition in Python (a filter before the join should add performance boost), supply it to the filter function and then aggregate.