通过使用PANDAS DATAFRAME将Scipy与Group最小化

发布于 2025-01-31 11:34:27 字数 1752 浏览 5 评论 0原文

我有一个数据框架（下面的示例DF），并试图最大程度地减少成本函数。

GrpId = ['A','A','A','A','A','A','B','B','B','B','B','B','B']
col1 = [69.1,70.5,71.4,72.8,73.2,74.2,208.0,209.2,210.2,211.0,211.2,211.7,212.5]
col2 = [2,3.1,1.1,2.1,6.0,1.1,1.2,1.3,3.1,2.9,5.0,6.1,3.2]
d = {'GrpId':GrpId,'col1':col1,'col2':col2}

df1 = pd.DataFrame(d)

以下是最小化和成本功能。

col1_const=[0,0,0,0,60.0,0,0,0]
col2_const=[0,0,0,0,0,100.0,0,0]

def main(type1,type2,type3,df):
    vall0=[type1,type2,type3]
    res=minimize(cost_fun, vall0, args=(df), method = 'SLSQP', tol=0.01)

    [type1,type2,type3]=res.x

    return type1,type2,type3

def cost_fun(v, df):

    df['col1_res'][i] = np.where((df['col1'][i]!=np.nan), ((1/0.095)*(np.sqrt(df['col1'][i])-np.sqrt(col1_const[4]*(0.1*v[1]+v[2])**2)))**2 ,0)
    df['col2_res'][i] = np.where((df['col2'][i]!=np.nan), ((1/0.12)*(np.sqrt(df['col2'][i])-np.sqrt(col2_const[5]*(0.1*v[0]+v[2])**2)))**2 ,0)   
    
    res=0.5*np.sqrt(df['col1_res'][i]+df['col2_res'][i])

    return res

然后，我在下面的循环中迭代此函数，但需要大量时间和内存，

df1['type1']=np.nan
df1['type2']=np.nan
df1['type3']=np.nan
df1['col3']=np.nan
df1['col1_res']=np.nan
df1['col2_res']=np.nan

for i in range(len(df1.GrpId)):
    if i==0:
        df1['type1'][i], df1['type2'][i], df1['type3'][i]= main(0.125, 0.125, 0.125,df1)
    else:
        df1['type1'][i], df1['type2'][i], df1['type3'][i]= main(df1['type1'][i-1], df1['type2'][i-1], df1['type3'][i-1],df1)
    df1['col3'][i]=df1['type1'][i]+df1['type2'][i]

请注意，我的数据框架具有更多的行和列，我刚刚创建了一个示例代码/案例。

我的问题是，

如果没有迭代col1_const [4]值，我该怎么做相同的操作
。值将根据组（GRPID组的组）更改 - 我还有另一个功能来计算每个组的Col1_const [4]值。在这种情况下，如何将此值传递给cost_fun。

原文

I have a data frame (a sample df below) and trying to minimize cost function on that.

GrpId = ['A','A','A','A','A','A','B','B','B','B','B','B','B']
col1 = [69.1,70.5,71.4,72.8,73.2,74.2,208.0,209.2,210.2,211.0,211.2,211.7,212.5]
col2 = [2,3.1,1.1,2.1,6.0,1.1,1.2,1.3,3.1,2.9,5.0,6.1,3.2]
d = {'GrpId':GrpId,'col1':col1,'col2':col2}

df1 = pd.DataFrame(d)

Below are minimize and cost function.

col1_const=[0,0,0,0,60.0,0,0,0]
col2_const=[0,0,0,0,0,100.0,0,0]

def main(type1,type2,type3,df):
    vall0=[type1,type2,type3]
    res=minimize(cost_fun, vall0, args=(df), method = 'SLSQP', tol=0.01)

    [type1,type2,type3]=res.x

    return type1,type2,type3

def cost_fun(v, df):

    df['col1_res'][i] = np.where((df['col1'][i]!=np.nan), ((1/0.095)*(np.sqrt(df['col1'][i])-np.sqrt(col1_const[4]*(0.1*v[1]+v[2])**2)))**2 ,0)
    df['col2_res'][i] = np.where((df['col2'][i]!=np.nan), ((1/0.12)*(np.sqrt(df['col2'][i])-np.sqrt(col2_const[5]*(0.1*v[0]+v[2])**2)))**2 ,0)   
    
    res=0.5*np.sqrt(df['col1_res'][i]+df['col2_res'][i])

    return res

Then I'm iterating this function in loop as below, which is working but takes lot of time and memory,

df1['type1']=np.nan
df1['type2']=np.nan
df1['type3']=np.nan
df1['col3']=np.nan
df1['col1_res']=np.nan
df1['col2_res']=np.nan

for i in range(len(df1.GrpId)):
    if i==0:
        df1['type1'][i], df1['type2'][i], df1['type3'][i]= main(0.125, 0.125, 0.125,df1)
    else:
        df1['type1'][i], df1['type2'][i], df1['type3'][i]= main(df1['type1'][i-1], df1['type2'][i-1], df1['type3'][i-1],df1)
    df1['col3'][i]=df1['type1'][i]+df1['type2'][i]

Please note that I have bigger dataframe with more rows and columns, for this questions I just created a sample code/case.

My questions are,

How can I do the same without iteration
col1_const[4] value will change as per the group (group by GrpId) - I have another function to calculate col1_const[4] values per group. How can I pass this value to cost_fun in that case by group.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

简美 2025-02-07 11:34:27

首先，我认为目标函数内的！= np.nan没有必要检查。取而代之的是，您可以清理数据框，并用零替换所有np.nan。在优化程序中，该目标函数被调用多次，因此应将其写入尽可能效率和快速。因此，我们删除了np.where的呼叫。还要注意，依靠索引变量i在外部范围中知道的事实是不良练习，并且使代码难以读取。我建议这样的事情：

col1 = df1.col1.values[df1.col1.values != np.nan]
col2 = df1.col2.values[df1.col2.values != np.nan]
col1_const = np.array([0,0,0,0,60.0,0,0,0])
col2_const = np.array([0,0,0,0,0,100.0,0,0])

def cost_fun1(v, *args):
    i, col1, col2, col1_const, col2_const = args
    col1_res = ((1/0.095)*(np.sqrt(col1[i]) - np.sqrt(col1_const[4]*(0.1*v[1]+v[2])**2)))**2
    col2_res = ((1/0.12)*(np.sqrt(col2[i]) - np.sqrt(col2_const[5]*(0.1*v[0]+v[2])**2)))**2
    return 0.5*np.sqrt(col1_res + col2_res)

接下来，更重要的是，您正在解决多个优化问题，而不是解决一个大规模优化问题。从数学上讲，由于您的目标函数保证是积极的，因此您可以以同样的方式重新重新制定问题，以此答案。然后，cost_fun2基本上返回所有索引i的所有cost_fun1的总和。使用一些重塑魔法，该功能几乎看起来相同：

def cost_fun2(vv, *args):
    col1, col2, col1_const, col2_const = args
    v = vv.reshape(3, col1.size)
    col1_res = ((1/0.095)*(np.sqrt(col1) - np.sqrt(col1_const[4]*(0.1*v[1]+v[2])**2)))**2
    col2_res = ((1/0.12)*(np.sqrt(col2) - np.sqrt(col2_const[5]*(0.1*v[0]+v[2])**2)))**2
    return np.sum(0.5*np.sqrt(col1_res + col2_res))

然后，我们只需解决问题并将解决方案值写入数据框中：

from scipy.optimize import minimize

# initial guess
x0 = np.ones(3*col1.size)

# solve the problem
res = minimize(lambda vv: cost_fun2(vv, col1, col2, col1_const, col2_const), x0=x0, method="trust-constr")

# write to dataframe
type1_vals, type2_vals, type3_vals = np.split(res.x, 3)
df1['type1'] = type1_vals
df1['type2'] = type2_vals
df1['type3'] = type3_vals

如果您需要col1_res 和col2_res在数据框中，相应地修改目标函数是直截了当的。

最后但并非最不重要的一点是，根据您的数据框的大小，强烈建议将确切的目标梯度传递给scipy.optimize.minimize，以获得良好的收敛性能。目前，梯度被有限差异近似，该差异很慢，容易舍入错误。

Firstly, I don't think it's necessary to check for != np.nan inside the objective function. Instead, you could clean up your dataframe and replace all np.nan with zero. The objective function is called several times during the optimization routine, so it should be written as efficient and fast as possible. Consequently, we remove the call of np.where. Note also that relying on the fact that the index variable i is known at the outer scope is bad practice and makes the code hard to read. I'd recommend something like this:

col1 = df1.col1.values[df1.col1.values != np.nan]
col2 = df1.col2.values[df1.col2.values != np.nan]
col1_const = np.array([0,0,0,0,60.0,0,0,0])
col2_const = np.array([0,0,0,0,0,100.0,0,0])

def cost_fun1(v, *args):
    i, col1, col2, col1_const, col2_const = args
    col1_res = ((1/0.095)*(np.sqrt(col1[i]) - np.sqrt(col1_const[4]*(0.1*v[1]+v[2])**2)))**2
    col2_res = ((1/0.12)*(np.sqrt(col2[i]) - np.sqrt(col2_const[5]*(0.1*v[0]+v[2])**2)))**2
    return 0.5*np.sqrt(col1_res + col2_res)

Next, and more importantly, you are solving multiple optimization problems instead of solving one large-scale optimization problem. Mathematically, because your objective function is guaranteed to be positive, you can reformulate your problem in the same vein to this answer. Then, cost_fun2 basically returns the sum of all cost_fun1 for all indices i. Using a bit of reshaping magic, the function nearly looks the same:

def cost_fun2(vv, *args):
    col1, col2, col1_const, col2_const = args
    v = vv.reshape(3, col1.size)
    col1_res = ((1/0.095)*(np.sqrt(col1) - np.sqrt(col1_const[4]*(0.1*v[1]+v[2])**2)))**2
    col2_res = ((1/0.12)*(np.sqrt(col2) - np.sqrt(col2_const[5]*(0.1*v[0]+v[2])**2)))**2
    return np.sum(0.5*np.sqrt(col1_res + col2_res))

Then, we simply solve the problem and write the solution values into the dataframe afterwards:

from scipy.optimize import minimize

# initial guess
x0 = np.ones(3*col1.size)

# solve the problem
res = minimize(lambda vv: cost_fun2(vv, col1, col2, col1_const, col2_const), x0=x0, method="trust-constr")

# write to dataframe
type1_vals, type2_vals, type3_vals = np.split(res.x, 3)
df1['type1'] = type1_vals
df1['type2'] = type2_vals
df1['type3'] = type3_vals

If you need col1_res and col2_res in the dataframe, it's straighforward to modify the objective function accordingly.

Last but not least, depending on the size of your dataframe, it's highly recommended to pass the exact objective gradient to scipy.optimize.minimize in order to obtain a good convergence performance. At the moment, the gradient is approximated by finite differences which is quite slow and prone to rounding errors.

回复收藏 0 原文

~没有更多了~