通过许多属性和字典查找来优化 Python 代码
我用 Python 编写了一个程序,该程序花费大量时间从字典键中查找对象的属性和值。我想知道是否有任何方法可以优化这些查找时间(可能使用 C 扩展)来减少执行时间,或者我是否需要简单地用编译语言重新实现该程序。
该程序使用图实现一些算法。它在我们的数据集上运行速度非常慢,因此我使用 分析了代码cProfile
使用实际可以完成的精简数据集。 绝大多数大部分时间都被烧在一个函数中,特别是在函数内的两条语句(生成器表达式)中:
第 202 行的生成器表达式是
neighbors_in_selected_nodes = (neighbor for neighbor in
node_neighbors if neighbor in selected_nodes)
,第 204 行的生成器表达式是
neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
neighbor in neighbors_in_selected_nodes)
源代码下面提供了此上下文函数的代码。
selected_nodes
是一个集
,它是一个 NetworkX interaction_graph
中节点的Graph
实例。 node_neighbors
是来自 Graph.neighbors_iter()
。
Graph 本身使用字典来存储节点和边。它的Graph.node
属性是一个字典,它将节点及其属性(例如,'weight'
)存储在属于每个节点的字典中。
这些查找中的每一个都应该摊销恒定时间(即 O(1)),但是,我仍然为查找付出了很大的代价。是否有某种方法可以加快这些查找速度(例如,通过将其中的一部分编写为 C 扩展),或者我是否需要将程序移至编译语言?
以下是提供上下文的函数的完整源代码;绝大多数执行时间都花在这个函数上。
def calculate_node_z_prime(
node,
interaction_graph,
selected_nodes
):
"""Calculates a z'-score for a given node.
The z'-score is based on the z-scores (weights) of the neighbors of
the given node, and proportional to the z-score (weight) of the
given node. Specifically, we find the maximum z-score of all
neighbors of the given node that are also members of the given set
of selected nodes, multiply this z-score by the z-score of the given
node, and return this value as the z'-score for the given node.
If the given node has no neighbors in the interaction graph, the
z'-score is defined as zero.
Returns the z'-score as zero or a positive floating point value.
:Parameters:
- `node`: the node for which to compute the z-prime score
- `interaction_graph`: graph containing the gene-gene or gene
product-gene product interactions
- `selected_nodes`: a `set` of nodes fitting some criterion of
interest (e.g., annotated with a term of interest)
"""
node_neighbors = interaction_graph.neighbors_iter(node)
neighbors_in_selected_nodes = (neighbor for neighbor in
node_neighbors if neighbor in selected_nodes)
neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
neighbor in neighbors_in_selected_nodes)
try:
max_z_score = max(neighbor_z_scores)
# max() throws a ValueError if its argument has no elements; in this
# case, we need to set the max_z_score to zero
except ValueError, e:
# Check to make certain max() raised this error
if 'max()' in e.args[0]:
max_z_score = 0
else:
raise e
z_prime = interaction_graph.node[node]['weight'] * max_z_score
return z_prime
以下是根据 cProfiler 的前几个调用,按时间排序。
ncalls tottime percall cumtime percall filename:lineno(function)
156067701 352.313 0.000 642.072 0.000 bpln_contextual.py:204(<genexpr>)
156067701 289.759 0.000 289.759 0.000 bpln_contextual.py:202(<genexpr>)
13963893 174.047 0.000 816.119 0.000 {max}
13963885 69.804 0.000 936.754 0.000 bpln_contextual.py:171(calculate_node_z_prime)
7116883 61.982 0.000 61.982 0.000 {method 'update' of 'set' objects}
I have written a program in Python which spends a large amount of time looking up attributes of objects and values from dictionary keys. I would like to know if there's any way I can optimize these lookup times, potentially with a C extension, to reduce the time of execution, or if I need to simply re-implement the program in a compiled language.
The program implements some algorithms using a graph. It runs prohibitively slowly on our data sets, so I profiled the code with cProfile
using a reduced data set that could actually complete. The vast majority of the time is being burned in one function, and specifically in two statements, generator expressions, within the function:
The generator expression at line 202 is
neighbors_in_selected_nodes = (neighbor for neighbor in
node_neighbors if neighbor in selected_nodes)
and the generator expression at line 204 is
neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
neighbor in neighbors_in_selected_nodes)
The source code for this function of context provided below.
selected_nodes
is a set
of nodes in the interaction_graph
, which is a NetworkX Graph
instance. node_neighbors
is an iterator from Graph.neighbors_iter()
.
Graph
itself uses dictionaries for storing nodes and edges. Its Graph.node
attribute is a dictionary which stores nodes and their attributes (e.g., 'weight'
) in dictionaries belonging to each node.
Each of these lookups should be amortized constant time (i.e., O(1)), however, I am still paying a large penalty for the lookups. Is there some way which I can speed up these lookups (e.g., by writing parts of this as a C extension), or do I need to move the program to a compiled language?
Below is the full source code for the function that provides the context; the vast majority of execution time is spent within this function.
def calculate_node_z_prime(
node,
interaction_graph,
selected_nodes
):
"""Calculates a z'-score for a given node.
The z'-score is based on the z-scores (weights) of the neighbors of
the given node, and proportional to the z-score (weight) of the
given node. Specifically, we find the maximum z-score of all
neighbors of the given node that are also members of the given set
of selected nodes, multiply this z-score by the z-score of the given
node, and return this value as the z'-score for the given node.
If the given node has no neighbors in the interaction graph, the
z'-score is defined as zero.
Returns the z'-score as zero or a positive floating point value.
:Parameters:
- `node`: the node for which to compute the z-prime score
- `interaction_graph`: graph containing the gene-gene or gene
product-gene product interactions
- `selected_nodes`: a `set` of nodes fitting some criterion of
interest (e.g., annotated with a term of interest)
"""
node_neighbors = interaction_graph.neighbors_iter(node)
neighbors_in_selected_nodes = (neighbor for neighbor in
node_neighbors if neighbor in selected_nodes)
neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
neighbor in neighbors_in_selected_nodes)
try:
max_z_score = max(neighbor_z_scores)
# max() throws a ValueError if its argument has no elements; in this
# case, we need to set the max_z_score to zero
except ValueError, e:
# Check to make certain max() raised this error
if 'max()' in e.args[0]:
max_z_score = 0
else:
raise e
z_prime = interaction_graph.node[node]['weight'] * max_z_score
return z_prime
Here are the top couple of calls according to cProfiler, sorted by time.
ncalls tottime percall cumtime percall filename:lineno(function)
156067701 352.313 0.000 642.072 0.000 bpln_contextual.py:204(<genexpr>)
156067701 289.759 0.000 289.759 0.000 bpln_contextual.py:202(<genexpr>)
13963893 174.047 0.000 816.119 0.000 {max}
13963885 69.804 0.000 936.754 0.000 bpln_contextual.py:171(calculate_node_z_prime)
7116883 61.982 0.000 61.982 0.000 {method 'update' of 'set' objects}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如何保持interaction_graph.neighbors_iter(node)的迭代顺序排序(或使用collections.heapq部分排序)?由于您只是想找到最大值,因此可以按降序迭代node_neighbors,selected_node 中的第一个节点必须是selected_node 中的最大值。
其次,selected_node 多久改变一次?如果它很少改变,您可以通过使用“interaction_graph.node[neighbor] for x in selected_node”列表来节省大量迭代,而不必每次都重建此列表。
编辑:回复评论
不一定,你看太多教科书了。不管你的教科书怎么说,你有时可以通过利用数据的某些结构来打破 O(n log n) 障碍。如果您首先将邻居列表保留在自然排序的数据结构中(例如 heapq、二叉树),则不需要在每次迭代时重新排序。当然,这是一个时空权衡,因为您需要存储冗余的邻居列表,并且存在代码复杂性以确保邻居列表在邻居更改时更新。
另外,python 的 list.sort() 使用 timsort 算法,对于接近排序的结果非常快数据(在某些情况下可以平均 O(n))。它仍然没有打破 O(n log n),这在数学上已被证明是不可能的。
您需要在放弃某个解决方案之前进行分析,因为该解决方案不太可能提高性能。在进行极端优化时,您可能会发现在某些非常特殊的边缘情况中,旧的且缓慢的冒泡排序可能会胜过美化的快速排序或合并排序。
How about keeping the iteration order of interaction_graph.neighbors_iter(node) sorted (or partially sorted using collections.heapq)? Since you're just trying to find the max value, you can iterate node_neighbors in descending order, the first node that is in selected_node must be the max in selected_node.
Second, how often will selected_node changes? If it changes rarely, you can save a lot of iterations by having a list of "interaction_graph.node[neighbor] for x in selected_node" instead of having to rebuild this list every time.
EDIT: to reply on the comments
Not necessarily, you're looking too much at your textbook. Despite what your textbook says, you can sometimes break the O(n log n) barrier by exploiting certain structure of your data. If you keep your list of neighbors in a naturally sorted data structure in the first place (e.g. heapq, binary tree), you don't need to re-sort at every iteration. Of course this is a space-time tradeoff, since you will need to store redundant lists of neighbors and there is code complexity to ensure that the list of neighbors is updated when the neighbors changes.
Also, python's list.sort(), which uses timsort algorithm, is very fast for nearly sorted data (could average O(n) in certain cases). It still doesn't break O(n log n), that much has been proven to be mathematically impossible.
You need to profile before dismissing a solution as not likely to improve performance. When doing extreme optimizations, you will likely find that in certain very special edge cases old and slow bubble sort may win over a glorified quicksort or mergesort.
我不明白为什么你的“权重”查找必须采用
["weight"]
的形式(节点是字典?)而不是.weight
(节点是对象)。如果您的节点是对象,并且没有很多字段,则可以利用
__slots__
指令 来优化其存储:编辑: 所以我查看了您提供的 NetworkX 链接,有几件事让我困扰。首先,在顶部,“字典”的定义是“FIXME”。
总的来说,它似乎坚持使用字典,而不是使用可以子类化的类来存储属性。虽然对象上的属性查找可能本质上是字典查找,但我不认为使用对象会变得更糟。如果有的话,它可能会更好,因为对象属性查找更有可能被优化,因为:
例如,我经常在表示坐标的类上使用 __slots__ 。对我来说,树节点似乎是另一个明显的用途。
这就是为什么当我读到:
我想,好吧,没问题,但紧接着就是
我想,如果一个节点需要属性,为什么还要单独存储呢?为什么不直接把它放在节点中呢?编写带有
contentString
和weight
成员的类是否有害?边看起来更疯狂,因为它们被规定为元组而不是可以子类化的对象。所以我对 NetworkX 背后的设计决策相当迷茫。
如果您坚持这样做,我建议将属性从这些字典移动到实际节点中,或者如果这不是一个选项,请使用整数作为键而不是字符串放入属性字典中,因此搜索使用更快的比较算法。
最后,如果您组合生成器会怎样:
I don't see why your "weight" lookups have to be in the form of
["weight"]
(nodes are dictionaries?) instead of.weight
(nodes are objects).If your nodes are objects, and don't have a lot of fields, you can take advantage of the
__slots__
directive to optimize their storage:EDIT: So I looked at the NetworkX link you provided, and there are several things that bother me. First is that, right at the top, the definition of "dictionary" is "FIXME".
Overall, it seems insistent on using dictionaries, rather than using classes that can be subclassed, to store attributes. While attribute lookup on an object may be essentially a dictionary lookup, I don't see how working with an object can be worse. If anything, it could be better since an object attribute lookup is more likely to be optimized, because:
__slots__
optimization for exactly these cases, where you have an object with only a couple fields and need optimized access to them.I frequently use
__slots__
on classes that represent coordinates, for example. A tree node would seem, to me, another obvious use.So that's why when I read:
I think, okay, no problem, but then immediately following is
I think, if a node needs attributes, why would it be stored separately? Why not just put it in the node? Is writing a class with
contentString
andweight
members detrimental? Edges seem even crazier, since they're dictated to be tuples and not objects which you could subclass.So I'm rather lost as to the design decisions behind NetworkX.
If you're stuck with it, I'd recommend moving attributes from those dictionaries into the actual nodes, or if that's not an option, using integers for keys into your attribute dictionary instead of strings, so searches use a much faster comparison algorithm.
Finally, what if you combined your generators:
尝试直接访问 dict 并捕获 KeyErrors,它可能会更快,具体取决于您的命中/未命中率:
或者使用 getdefault 和列表理解:
Try just directly accessing the dict and catch KeyErrors, it might be faster depending on your hit/miss ratio:
or with the getdefault and list comprehension:
在不深入研究代码的情况下,尝试使用
itertools
提高一点速度。在模块级别添加这些:
Change:
into:
和:
into:
这些有帮助吗?
Without looking deeply into your code, try adding a little speed with
itertools
.Add these at the module level:
Change:
into:
and:
into:
Do these help?