通过使用PANDAS DATAFRAME将Scipy与Group最小化
我有一个数据框架(下面的示例DF),并试图最大程度地减少成本函数。
GrpId = ['A','A','A','A','A','A','B','B','B','B','B','B','B']
col1 = [69.1,70.5,71.4,72.8,73.2,74.2,208.0,209.2,210.2,211.0,211.2,211.7,212.5]
col2 = [2,3.1,1.1,2.1,6.0,1.1,1.2,1.3,3.1,2.9,5.0,6.1,3.2]
d = {'GrpId':GrpId,'col1':col1,'col2':col2}
df1 = pd.DataFrame(d)
以下是最小化和成本功能。
col1_const=[0,0,0,0,60.0,0,0,0]
col2_const=[0,0,0,0,0,100.0,0,0]
def main(type1,type2,type3,df):
vall0=[type1,type2,type3]
res=minimize(cost_fun, vall0, args=(df), method = 'SLSQP', tol=0.01)
[type1,type2,type3]=res.x
return type1,type2,type3
def cost_fun(v, df):
df['col1_res'][i] = np.where((df['col1'][i]!=np.nan), ((1/0.095)*(np.sqrt(df['col1'][i])-np.sqrt(col1_const[4]*(0.1*v[1]+v[2])**2)))**2 ,0)
df['col2_res'][i] = np.where((df['col2'][i]!=np.nan), ((1/0.12)*(np.sqrt(df['col2'][i])-np.sqrt(col2_const[5]*(0.1*v[0]+v[2])**2)))**2 ,0)
res=0.5*np.sqrt(df['col1_res'][i]+df['col2_res'][i])
return res
然后,我在下面的循环中迭代此函数,但需要大量时间和内存,
df1['type1']=np.nan
df1['type2']=np.nan
df1['type3']=np.nan
df1['col3']=np.nan
df1['col1_res']=np.nan
df1['col2_res']=np.nan
for i in range(len(df1.GrpId)):
if i==0:
df1['type1'][i], df1['type2'][i], df1['type3'][i]= main(0.125, 0.125, 0.125,df1)
else:
df1['type1'][i], df1['type2'][i], df1['type3'][i]= main(df1['type1'][i-1], df1['type2'][i-1], df1['type3'][i-1],df1)
df1['col3'][i]=df1['type1'][i]+df1['type2'][i]
请注意,我的数据框架具有更多的行和列,我刚刚创建了一个示例代码/案例。
我的问题是,
- 如果没有迭代
col1_const [4]
值,我该怎么做相同的操作 - 。值将根据组(GRPID组的组)更改 - 我还有另一个功能来计算每个组的Col1_const [4]值。在这种情况下,如何将此值传递给cost_fun。
I have a data frame (a sample df below) and trying to minimize cost function on that.
GrpId = ['A','A','A','A','A','A','B','B','B','B','B','B','B']
col1 = [69.1,70.5,71.4,72.8,73.2,74.2,208.0,209.2,210.2,211.0,211.2,211.7,212.5]
col2 = [2,3.1,1.1,2.1,6.0,1.1,1.2,1.3,3.1,2.9,5.0,6.1,3.2]
d = {'GrpId':GrpId,'col1':col1,'col2':col2}
df1 = pd.DataFrame(d)
Below are minimize and cost function.
col1_const=[0,0,0,0,60.0,0,0,0]
col2_const=[0,0,0,0,0,100.0,0,0]
def main(type1,type2,type3,df):
vall0=[type1,type2,type3]
res=minimize(cost_fun, vall0, args=(df), method = 'SLSQP', tol=0.01)
[type1,type2,type3]=res.x
return type1,type2,type3
def cost_fun(v, df):
df['col1_res'][i] = np.where((df['col1'][i]!=np.nan), ((1/0.095)*(np.sqrt(df['col1'][i])-np.sqrt(col1_const[4]*(0.1*v[1]+v[2])**2)))**2 ,0)
df['col2_res'][i] = np.where((df['col2'][i]!=np.nan), ((1/0.12)*(np.sqrt(df['col2'][i])-np.sqrt(col2_const[5]*(0.1*v[0]+v[2])**2)))**2 ,0)
res=0.5*np.sqrt(df['col1_res'][i]+df['col2_res'][i])
return res
Then I'm iterating this function in loop as below, which is working but takes lot of time and memory,
df1['type1']=np.nan
df1['type2']=np.nan
df1['type3']=np.nan
df1['col3']=np.nan
df1['col1_res']=np.nan
df1['col2_res']=np.nan
for i in range(len(df1.GrpId)):
if i==0:
df1['type1'][i], df1['type2'][i], df1['type3'][i]= main(0.125, 0.125, 0.125,df1)
else:
df1['type1'][i], df1['type2'][i], df1['type3'][i]= main(df1['type1'][i-1], df1['type2'][i-1], df1['type3'][i-1],df1)
df1['col3'][i]=df1['type1'][i]+df1['type2'][i]
Please note that I have bigger dataframe with more rows and columns, for this questions I just created a sample code/case.
My questions are,
- How can I do the same without iteration
col1_const[4]
value will change as per the group (group by GrpId) - I have another function to calculate col1_const[4] values per group. How can I pass this value to cost_fun in that case by group.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
首先,我认为目标函数内的
!= np.nan
没有必要检查。取而代之的是,您可以清理数据框,并用零替换所有np.nan
。在优化程序中,该目标函数被调用多次,因此应将其写入尽可能效率和快速。因此,我们删除了np.where
的呼叫。还要注意,依靠索引变量i
在外部范围中知道的事实是不良练习,并且使代码难以读取。我建议这样的事情:接下来,更重要的是,您正在解决多个优化问题,而不是解决一个大规模优化问题。从数学上讲,由于您的目标函数保证是积极的,因此您可以以同样的方式重新重新制定问题,以此答案。然后,
cost_fun2
基本上返回所有索引i的所有cost_fun1
的总和。使用一些重塑魔法,该功能几乎看起来相同:然后,我们只需解决问题并将解决方案值写入数据框中:
如果您需要
col1_res
和col2_res
在数据框中,相应地修改目标函数是直截了当的。最后但并非最不重要的一点是,根据您的数据框的大小,强烈建议将确切的目标梯度传递给
scipy.optimize.minimize
,以获得良好的收敛性能。目前,梯度被有限差异近似,该差异很慢,容易舍入错误。Firstly, I don't think it's necessary to check for
!= np.nan
inside the objective function. Instead, you could clean up your dataframe and replace allnp.nan
with zero. The objective function is called several times during the optimization routine, so it should be written as efficient and fast as possible. Consequently, we remove the call ofnp.where
. Note also that relying on the fact that the index variablei
is known at the outer scope is bad practice and makes the code hard to read. I'd recommend something like this:Next, and more importantly, you are solving multiple optimization problems instead of solving one large-scale optimization problem. Mathematically, because your objective function is guaranteed to be positive, you can reformulate your problem in the same vein to this answer. Then,
cost_fun2
basically returns the sum of allcost_fun1
for all indices i. Using a bit of reshaping magic, the function nearly looks the same:Then, we simply solve the problem and write the solution values into the dataframe afterwards:
If you need
col1_res
andcol2_res
in the dataframe, it's straighforward to modify the objective function accordingly.Last but not least, depending on the size of your dataframe, it's highly recommended to pass the exact objective gradient to
scipy.optimize.minimize
in order to obtain a good convergence performance. At the moment, the gradient is approximated by finite differences which is quite slow and prone to rounding errors.