在 numpy 中计算联合 pmf 的条件概率,太慢了。有想法吗? (python-numpy)
我有一个联合概率质量函数数组,其形状例如 (1,2,3,4,5,6),我想计算概率表,以某些维度的值为条件(导出 cpts) ,用于决策目的。
我现在想出的代码如下(输入是形式为 {'variable_1': value_1, 'variable_2': value_2 ... } 的字典“vdict”)
for i in vdict:
dim = self.invardict.index(i) # The index of the dimension that our Variable resides in
val = self.valdict[i][vdict[i]] # The value we want it to be
d = d.swapaxes(0, dim)
**d = array([d[val]])**
d = d.swapaxes(0, dim)
...
所以,我目前所做的是:
- 我将变量转换为 cpt 中相应的维度。
- 我将第零轴与之前找到的轴交换。
- 我用所需的值替换整个 0 轴。
我将尺寸放回原来的轴。
现在的问题是,为了执行步骤 2,我需要 (a.) 计算一个子数组 (b.) 将其放入列表中并再次将其转换为数组,这样我将拥有新的数组。
事实是,粗体的内容意味着我创建新对象,而不是仅使用对旧对象的引用,如果 d 非常大(这发生在我身上)并且使用 d 的方法被调用很多次(这又是,发生在我身上)整个结果非常慢。
那么,有没有人想出一个想法来改进这一小段代码并且运行得更快?也许可以让我计算适当的条件。
注意:我必须保持原始的轴顺序(或者至少确定在删除轴时如何将变量更新为尺寸字典)。我不想诉诸自定义数据类型。
I have a conjunctive probability mass function array, with shape, for example (1,2,3,4,5,6) and I want to calculate the probability table, conditional to a value for some of the dimensions (export the cpts), for decision-making purposes.
The code I came up with at the moment is the following (the input is the dictionary "vdict" of the form {'variable_1': value_1, 'variable_2': value_2 ... } )
for i in vdict:
dim = self.invardict.index(i) # The index of the dimension that our Variable resides in
val = self.valdict[i][vdict[i]] # The value we want it to be
d = d.swapaxes(0, dim)
**d = array([d[val]])**
d = d.swapaxes(0, dim)
...
So, what I currently do is:
- I translate the variables to the corresponding dimension in the cpt.
- I swap the zero-th axis with the axis I found before.
- I replace whole 0-axis with just the desired value.
I put the dimension back to its original axis.
Now, the problem is, in order to do step 2, I have (a.) to calculate a subarray
and (b.) to put it in a list and translate it again to array so I'll have my new array.
Thing is, stuff in bold means that I create new objects, instead of using just the references to the old ones and this, if d is very large (which happens to me) and methods that use d are called many times (which, again, happens to me) the whole result is very slow.
So, has anyone come up with an idea that will subtitude this little piece of code and will run faster? Maybe something that will allow me to calculate the conditionals in place.
Note: I have to maintain original axis order (or at least be sure on how to update the variable to dimensions dictionaries when an axis is removed). I'd like not to resort in custom dtypes.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
好吧,在玩了一下 numpy 的就地数组操作后,我自己找到了答案。
将循环中的最后 3 行更改为:
其中条件化定义为:
这使我的程序的执行时间从 15 分钟减少到 6 秒。收获巨大。
我希望这可以帮助遇到同样问题的人。
Ok, found the answer myself after playing a little with numpy's in-place array manipulations.
Changed the last 3 lines in the loop to:
where conditionalize is defined as:
That made my program's execution time reduce from 15 minutes to 6 seconds. Huge gain.
I hope this helps someone who comes across the same problem.