使用 Python 拟合模拟和实验数据点

发布于 2024-12-05 11:06:55 字数 1561 浏览 2 评论 0原文

我编写了一些代码来执行蒙特卡罗模拟并生成信号强度与时间的曲线。这种曲线的形状取决于各种参数，我的合作者希望通过我正在模拟的实验的“现实生活版本”来确定其中两个参数。

我们准备将她的实验数据与我的模拟曲线进行比较，但现在我陷入困境，因为我还无法进行任何拟合（到目前为止我已将实验数据替换为模拟噪声数据进行测试）。我尝试过使用 scipy.optimize.leastsq ，它以代码 2 退出（根据文档，这意味着拟合成功），但它大多只是返回值（不完全相同，但close）我输入作为初始猜测，无论它们与真实值有多接近或多远。如果它确实报告了不同的值，则生成的曲线仍然与真实曲线有很大差异。

另一个观察结果是，infodict['nfev'] 总是包含

The relative error between two consecutive iterates is at most 0.000000

在使用我的模拟噪声数据时，我已经尝试过两个参数的真实值具有相同的数量级（因为我认为步长大小在使用中可能只会明显地影响其中一个，否则），数量级非常不同，并且我改变了步长（参数epsfcn），但无济于事。

有谁知道我可能做错了什么，或者我可以使用什么拟合函数来代替 leastsq？如果是这样：提前非常感谢！

编辑

正如 Russ 所建议的，我现在将提供有关如何执行模拟的一些细节：我们正在研究小分子与大分子的结合。这种情况发生的概率取决于它们的相互亲和力（亲和力是从实验数据中提取的值之一）。一旦发生结合，我们还模拟复合物再次分解所需的时间（解离时间常数是我们感兴趣的第二个参数）。还有许多其他参数，但它们仅在计算预期信号强度时才变得相关，因此与实际模拟无关。

我们从给定数量的小分子开始，每个小分子的状态都被模拟多个时间步长。在每个时间步骤，我们使用亲和力值来确定该分子（如果未结合）是否与大分子结合。如果它已经结合，我们使用解离时间常数和它已经结合的时间来确定它是否在此步骤中解离。

在这两种情况下，参数（亲和力、解离时间常数）用于计算概率，然后将其与随机数（0 和 1 之间）进行比较，并且根据此比较，它取决于小分子的状态（结合/不受约束）的变化。

编辑2

没有明确定义的函数来确定结果曲线的形状，并且即使形状明显可再现，每个个体都存在随机性元素 因此，我现在尝试使用 optimize.fmin 而不是 leastsq，但它不会收敛，只是在最大数量后退出已经执行了迭代次数。

编辑3

按照Andrea的建议，我上传了示例图。我真的不认为提供样本数据会有很大帮助，它只是每个 x 值（时间）一个 y 值（信号强度）......

原文

I have written some code which performs a Monte Carlo simulation and produces curves of signal intensity versus time. The shape of such a curve depends on various parameters, two of which my collaborator wants to determine by the 'real life version' of the experiment I am simulating.

We are ready to compare her experimental data with my simulated curves, but now I am stuck, as I have not yet been able to perform any fit (so far I have replaced the experimental data with simulated noisy data for testing). I have tried using scipy.optimize.leastsq, which exits with code 2 (according to the documentation this means that the fitting was successful), but it mostly simply returns the values (not exactly the same, but close) I input as an initial guess, no matter how close to or far off they were from the true values. If it does report different values, the resulting curve still differs significantly from the true one.

Another observation is that infodict['nfev'] invariably contains

The relative error between two consecutive iterates is at most 0.000000

While using my simulated noisy data I have played around with both parameters' true values being of the same order of magnitude (as I thought that the step size in use might only sensibly affect one of them otherwise), of a very different order of magnitude, and I have varied the step size (parameter epsfcn), but to no avail.

Does anyone have any idea what I might be doing wrong, or what fitting function I could use instead of leastsq? If so: thanks a lot in advance!

EDIT

As suggested by Russ I will now provide some detail on how the simulation is performed: we're looking at small molecules binding to large molecules. This happens with a probability which depends on their mutual affinity (the affinity is one of the values to be extracted from the experimental data). Once binding has occurred, we also simulate how long it takes until the complex falls apart again (the dissociation time constant is the second parameter we're interested in). There are a number of other parameters, but they only become relevant when the expected signal intensity is calculated, so they are not relevant for the actual simulation.

We start off with a given number of small molecules, the state of each of which is simulated for a number of time steps. At each time step we use the affinity value to determine whether this molecule, in case it's unbound, binds to a large molecule. In case it's bound already we use the dissociation time constant and the amount of time for which it has already been bound to determine whether it dissociates in this step.

In both cases the parameter (affinity, dissociation time constant) is used to calculate a probability, which is then compared to a random number (between 0 and 1), and on this comparison it depends if the state of the small molecule (bound/unbound) changes.

EDIT 2

There is no well-defined function which determines the shape of the resulting curve, and, even though the shape is clearly reproducible, there is an element of randomness to each individual data point. Therefore I have now attempted using optimize.fmin instead of leastsq, but it does not converge and simply exits after the maximum number of iterations have been performed.

EDIT 3

As suggested by Andrea I have uploaded a sample plot. I don't really think providing sample data would help a great deal, it's just one y-value (signal intensity) per x-value (time)...

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

寄意 2024-12-12 11:06:55

为了使用任意函数拟合数据，您通常需要 Levenberg–Marquardt算法，这就是 scipy.optimize.leastsq 使用的，因此您很可能使用正确的函数。

查看此页面第 5.4 节中的教程，看看是否可以这有帮助。

您的底层功能也可能难以适应（该功能是什么？），但听起来您可能遇到其他问题。

另外 - 与 StackOverflow 一样出色，通过直接发布到 Scipy-User 邮件，您可能会获得更多有关 scipy 问题的知识渊博的答复列出，其中包含一些示例代码和更多详细信息。

回复收藏 0 原文

忆梦 2024-12-12 11:06:55

不完全是一个答案，但也可以尝试 PyMinuit。

http://code.google.com/p/pyminuit/

您想要做的是将您的 pdf 和数据点转换为 chi^2 或 -ln(likelihood) 或您选择的指标，并使用 PyMinuit 最小化该指标。它可以配置得非常详细，这样你就可以找出哪里出了问题（如果确实出了问题）。

回复收藏 0 原文

木落 2024-12-12 11:06:55

如果您不知道全局的预期函数形式，但可以在给定系统当前状态的情况下预测“下一个”点的预期值，您可以考虑使用卡尔曼滤波器（是的，“滤波器”在合适的上下文中听起来很愚蠢，但该名称是历史性的并且不能轻易更改现在）。

底层的数学可能看起来有点可怕，但重要的一点是你不必理解它。您通常需要能够

定义表示空间
并在表示空间中表达您的数据（模拟或实验）。
定义一个更新过程，从给定的表示中获取“下一个”表示从
拟合器返回的一系列表示中提取所需的曲线

似乎至少有一个现有的Python包来支持这一点（请注意，这里的界面与我习惯的界面不同，我无法提供太多使用它的建议）。

回复收藏 0 原文

酒儿 2024-12-12 11:06:55

由于您只有两个参数，因此您应该只进行网格搜索。

results = {}
for p0 in parameter_space_for_first_parameter:
     for p1 in parameter_space_for_second_parameter:
           results[p0,p1] = compare(p0,p1)

如果您有能力进行计算，compare 应该执行多次运行（使用不同的初始化）并计算平均值和标准差。您可以尝试使用我的包 jug 来管理您的计算（它是专门为之类的事情）。

最后，绘制结果，查看最小值（可能有几个）。这是一种“愚蠢”的方法，但在许多其他方法陷入困境的情况下它会起作用。

如果计算量太大，您可以分两次执行此操作：一个粗粒度网格，然后是靠近粗粒度空间最小值的细粒度网格。

Since you only have two parameters, you should just do a grid search.

results = {}
for p0 in parameter_space_for_first_parameter:
     for p1 in parameter_space_for_second_parameter:
           results[p0,p1] = compare(p0,p1)

If you can afford the computation, compare should perform multiple runs (with different initialisations) and compute mean and standard deviations. You can try using my package jug to manage your computations (it was designed for exactly this sort of thing).

Finally, plot the results, look at the minima (there might be several). This is a "stupid" method, but it will work in many situations where other methods get stuck.

If this is too much computation, you can do this in two passes: a coarse-grained grid followed by a fine-grained one near the minima of the coarse-grained space.

回复收藏 0 原文

~没有更多了~