通过许多属性和字典查找来优化 Python 代码

发布于 2024-08-27 22:28:14 字数 3912 浏览 3 评论 0原文

我用 Python 编写了一个程序,该程序花费大量时间从字典键中查找对象的属性和值。我想知道是否有任何方法可以优化这些查找时间(可能使用 C 扩展)来减少执行时间,或者我是否需要简单地用编译语言重新实现该程序。

该程序使用图实现一些算法。它在我们的数据集上运行速度非常慢,因此我使用 分析了代码cProfile 使用实际可以完成的精简数据集。 绝大多数大部分时间都被烧在一个函数中,特别是在函数内的两条语句(生成器表达式)中:

第 202 行的生成器表达式是

    neighbors_in_selected_nodes = (neighbor for neighbor in
            node_neighbors if neighbor in selected_nodes)

,第 204 行的生成器表达式是

    neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
            neighbor in neighbors_in_selected_nodes)

源代码下面提供了此上下文函数的代码。

selected_nodes 是一个interaction_graph 中节点的,它是一个 NetworkX Graph 实例。 node_neighbors 是来自 Graph.neighbors_iter()

Graph 本身使用字典来存储节点和边。它的Graph.node属性是一个字典,它将节点及其属性(例如,'weight')存储在属于每个节点的字典中。

这些查找中的每一个都应该摊销恒定时间(即 O(1)),但是,我仍然为查找付出了很大的代价。是否有某种方法可以加快这些查找速度(例如,通过将其中的一部分编写为 C 扩展),或者我是否需要将程序移至编译语言?


以下是提供上下文的函数的完整源代码;绝大多数执行时间都花在这个函数上。

def calculate_node_z_prime(
        node,
        interaction_graph,
        selected_nodes
    ):
    """Calculates a z'-score for a given node.

    The z'-score is based on the z-scores (weights) of the neighbors of
    the given node, and proportional to the z-score (weight) of the
    given node. Specifically, we find the maximum z-score of all
    neighbors of the given node that are also members of the given set
    of selected nodes, multiply this z-score by the z-score of the given
    node, and return this value as the z'-score for the given node.

    If the given node has no neighbors in the interaction graph, the
    z'-score is defined as zero.

    Returns the z'-score as zero or a positive floating point value.

    :Parameters:
    - `node`: the node for which to compute the z-prime score
    - `interaction_graph`: graph containing the gene-gene or gene
      product-gene product interactions
    - `selected_nodes`: a `set` of nodes fitting some criterion of
      interest (e.g., annotated with a term of interest)

    """
    node_neighbors = interaction_graph.neighbors_iter(node)
    neighbors_in_selected_nodes = (neighbor for neighbor in
            node_neighbors if neighbor in selected_nodes)
    neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
            neighbor in neighbors_in_selected_nodes)
    try:
        max_z_score = max(neighbor_z_scores)
    # max() throws a ValueError if its argument has no elements; in this
    # case, we need to set the max_z_score to zero
    except ValueError, e:
        # Check to make certain max() raised this error
        if 'max()' in e.args[0]:
            max_z_score = 0
        else:
            raise e

    z_prime = interaction_graph.node[node]['weight'] * max_z_score
    return z_prime

以下是根据 cProfiler 的前几个调用,按时间排序。

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
156067701  352.313    0.000  642.072    0.000 bpln_contextual.py:204(<genexpr>)
156067701  289.759    0.000  289.759    0.000 bpln_contextual.py:202(<genexpr>)
 13963893  174.047    0.000  816.119    0.000 {max}
 13963885   69.804    0.000  936.754    0.000 bpln_contextual.py:171(calculate_node_z_prime)
  7116883   61.982    0.000   61.982    0.000 {method 'update' of 'set' objects}

I have written a program in Python which spends a large amount of time looking up attributes of objects and values from dictionary keys. I would like to know if there's any way I can optimize these lookup times, potentially with a C extension, to reduce the time of execution, or if I need to simply re-implement the program in a compiled language.

The program implements some algorithms using a graph. It runs prohibitively slowly on our data sets, so I profiled the code with cProfile using a reduced data set that could actually complete. The vast majority of the time is being burned in one function, and specifically in two statements, generator expressions, within the function:

The generator expression at line 202 is

    neighbors_in_selected_nodes = (neighbor for neighbor in
            node_neighbors if neighbor in selected_nodes)

and the generator expression at line 204 is

    neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
            neighbor in neighbors_in_selected_nodes)

The source code for this function of context provided below.

selected_nodes is a set of nodes in the interaction_graph, which is a NetworkX Graph instance. node_neighbors is an iterator from Graph.neighbors_iter().

Graph itself uses dictionaries for storing nodes and edges. Its Graph.node attribute is a dictionary which stores nodes and their attributes (e.g., 'weight') in dictionaries belonging to each node.

Each of these lookups should be amortized constant time (i.e., O(1)), however, I am still paying a large penalty for the lookups. Is there some way which I can speed up these lookups (e.g., by writing parts of this as a C extension), or do I need to move the program to a compiled language?


Below is the full source code for the function that provides the context; the vast majority of execution time is spent within this function.

def calculate_node_z_prime(
        node,
        interaction_graph,
        selected_nodes
    ):
    """Calculates a z'-score for a given node.

    The z'-score is based on the z-scores (weights) of the neighbors of
    the given node, and proportional to the z-score (weight) of the
    given node. Specifically, we find the maximum z-score of all
    neighbors of the given node that are also members of the given set
    of selected nodes, multiply this z-score by the z-score of the given
    node, and return this value as the z'-score for the given node.

    If the given node has no neighbors in the interaction graph, the
    z'-score is defined as zero.

    Returns the z'-score as zero or a positive floating point value.

    :Parameters:
    - `node`: the node for which to compute the z-prime score
    - `interaction_graph`: graph containing the gene-gene or gene
      product-gene product interactions
    - `selected_nodes`: a `set` of nodes fitting some criterion of
      interest (e.g., annotated with a term of interest)

    """
    node_neighbors = interaction_graph.neighbors_iter(node)
    neighbors_in_selected_nodes = (neighbor for neighbor in
            node_neighbors if neighbor in selected_nodes)
    neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
            neighbor in neighbors_in_selected_nodes)
    try:
        max_z_score = max(neighbor_z_scores)
    # max() throws a ValueError if its argument has no elements; in this
    # case, we need to set the max_z_score to zero
    except ValueError, e:
        # Check to make certain max() raised this error
        if 'max()' in e.args[0]:
            max_z_score = 0
        else:
            raise e

    z_prime = interaction_graph.node[node]['weight'] * max_z_score
    return z_prime

Here are the top couple of calls according to cProfiler, sorted by time.

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
156067701  352.313    0.000  642.072    0.000 bpln_contextual.py:204(<genexpr>)
156067701  289.759    0.000  289.759    0.000 bpln_contextual.py:202(<genexpr>)
 13963893  174.047    0.000  816.119    0.000 {max}
 13963885   69.804    0.000  936.754    0.000 bpln_contextual.py:171(calculate_node_z_prime)
  7116883   61.982    0.000   61.982    0.000 {method 'update' of 'set' objects}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

凉月流沐 2024-09-03 22:28:14

如何保持interaction_graph.neighbors_iter(node)的迭代顺序排序(或使用collections.heapq部分排序)?由于您只是想找到最大值,因此可以按降序迭代node_neighbors,selected_node 中的第一个节点必须是selected_node 中的最大值。

其次,selected_node 多久改变一次?如果它很少改变,您可以通过使用“interaction_graph.node[neighbor] for x in selected_node”列表来节省大量迭代,而不必每次都重建此列表。

编辑:回复评论

sort() 需要 O(n log n)

不一定,你看太多教科书了。不管你的教科书怎么说,你有时可以通过利用数据的某些结构来打破 O(n log n) 障碍。如果您首先将邻居列表保留在自然排序的数据结构中(例如 heapq、二叉树),则不需要在每次迭代时重新排序。当然,这是一个时空权衡,因为您需要存储冗余的邻居列表,并且存在代码复杂性以确保邻居列表在邻居更改时更新。

另外,python 的 list.sort() 使用 timsort 算法,对于接近排序的结果非常快数据(在某些情况下可以平均 O(n))。它仍然没有打破 O(n log n),这在数学上已被证明是不可能的。

您需要在放弃某个解决方案之前进行分析,因为该解决方案不太可能提高性能。在进行极端优化时,您可能会发现在某些非常特殊的边缘情况中,旧的且缓慢的冒泡排序可能会胜过美化的快速排序或合并排序。

How about keeping the iteration order of interaction_graph.neighbors_iter(node) sorted (or partially sorted using collections.heapq)? Since you're just trying to find the max value, you can iterate node_neighbors in descending order, the first node that is in selected_node must be the max in selected_node.

Second, how often will selected_node changes? If it changes rarely, you can save a lot of iterations by having a list of "interaction_graph.node[neighbor] for x in selected_node" instead of having to rebuild this list every time.

EDIT: to reply on the comments

A sort() would take O(n log n)

Not necessarily, you're looking too much at your textbook. Despite what your textbook says, you can sometimes break the O(n log n) barrier by exploiting certain structure of your data. If you keep your list of neighbors in a naturally sorted data structure in the first place (e.g. heapq, binary tree), you don't need to re-sort at every iteration. Of course this is a space-time tradeoff, since you will need to store redundant lists of neighbors and there is code complexity to ensure that the list of neighbors is updated when the neighbors changes.

Also, python's list.sort(), which uses timsort algorithm, is very fast for nearly sorted data (could average O(n) in certain cases). It still doesn't break O(n log n), that much has been proven to be mathematically impossible.

You need to profile before dismissing a solution as not likely to improve performance. When doing extreme optimizations, you will likely find that in certain very special edge cases old and slow bubble sort may win over a glorified quicksort or mergesort.

我还不会笑 2024-09-03 22:28:14

我不明白为什么你的“权重”查找必须采用 ["weight"] 的形式(节点是字典?)而不是 .weight (节点是对象)。

如果您的节点是对象,并且没有很多字段,则可以利用 __slots__ 指令 来优化其存储:

class Node(object):
    # ... class stuff goes here ...

    __slots__ = ('weight',) # tuple of member names.

编辑: 所以我查看了您提供的 NetworkX 链接,有几件事让我困扰。首先,在顶部,“字典”的定义是“FIXME”。

总的来说,它似乎坚持使用字典,而不是使用可以子类化的类来存储属性。虽然对象上的属性查找可能本质上是字典查找,但我不认为使用对象会变得更糟。如果有的话,它可能会更好,因为对象属性查找更有可能被优化,因为:

  • 对象属性查找非常常见,
  • 对象属性的键空间比字典键受到更多限制,因此可以在搜索中使用优化的比较算法,并且
  • 对象具有针对这些情况的 __slots__ 优化,在这些情况下,您的对象只有几个字段,并且需要优化对它们的访问。

例如,我经常在表示坐标的类上使用 __slots__ 。对我来说,树节点似乎是另一个明显的用途。

这就是为什么当我读到:

节点
节点可以是除 None 之外的任何可哈希 Python 对象。

我想,好吧,没问题,但紧接着就是

节点属性
添加节点或分配给指定节点 n 的 G.node[n] 属性字典时,可以使用关键字/值对将任意 Python 对象分配为属性。

我想,如果一个节点需要属性,为什么还要单独存储呢?为什么不直接把它放在节点中呢?编写带有 contentStringweight 成员的类是否有害?边看起来更疯狂,因为它们被规定为元组而不是可以子类化的对象。

所以我对 NetworkX 背后的设计决策相当迷茫。

如果您坚持这样做,我建议将属性从这些字典移动到实际节点中,或者如果这不是一个选项,请使用整数作为键而不是字符串放入属性字典中,因此搜索使用更快的比较算法。

最后,如果您组合生成器会怎样:

neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
        neighbor in node_neighbors if neighbor in selected_nodes)

I don't see why your "weight" lookups have to be in the form of ["weight"] (nodes are dictionaries?) instead of .weight (nodes are objects).

If your nodes are objects, and don't have a lot of fields, you can take advantage of the __slots__ directive to optimize their storage:

class Node(object):
    # ... class stuff goes here ...

    __slots__ = ('weight',) # tuple of member names.

EDIT: So I looked at the NetworkX link you provided, and there are several things that bother me. First is that, right at the top, the definition of "dictionary" is "FIXME".

Overall, it seems insistent on using dictionaries, rather than using classes that can be subclassed, to store attributes. While attribute lookup on an object may be essentially a dictionary lookup, I don't see how working with an object can be worse. If anything, it could be better since an object attribute lookup is more likely to be optimized, because:

  • object attribute lookups are so common,
  • the keyspace for object attributes is far more restricted than for dictionary keys, thus an optimized comparison algorithm can be used in the search, and
  • objects have the __slots__ optimization for exactly these cases, where you have an object with only a couple fields and need optimized access to them.

I frequently use __slots__ on classes that represent coordinates, for example. A tree node would seem, to me, another obvious use.

So that's why when I read:

node
A node can be any hashable Python object except None.

I think, okay, no problem, but then immediately following is

node attribute
Nodes can have arbitrary Python objects assigned as attributes by using keyword/value pairs when adding a node or assigning to the G.node[n] attribute dictionary for the specified node n.

I think, if a node needs attributes, why would it be stored separately? Why not just put it in the node? Is writing a class with contentString and weight members detrimental? Edges seem even crazier, since they're dictated to be tuples and not objects which you could subclass.

So I'm rather lost as to the design decisions behind NetworkX.

If you're stuck with it, I'd recommend moving attributes from those dictionaries into the actual nodes, or if that's not an option, using integers for keys into your attribute dictionary instead of strings, so searches use a much faster comparison algorithm.

Finally, what if you combined your generators:

neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
        neighbor in node_neighbors if neighbor in selected_nodes)
尐籹人 2024-09-03 22:28:14

尝试直接访问 dict 并捕获 KeyErrors,它可能会更快,具体取决于您的命中/未命中率:

# cache this object
ignode = interaction_graph.node
neighbor_z_scores = []
for neighbor in node_neighbors:
    try:
        neighbor_z_scores.append(ignode[neighbor]['weight'])
    except KeyError:
        pass

或者使用 getdefault 和列表理解:

sentinel = object()
# cache this object 
ignode = interaction_graph.node

neighbor_z_scores = (ignode[neighbor]['weight'] for neighbor in node_neighbors)
# using identity testing, it's slightly faster
neighbor_z_scores = (neighbor for neighbor in neighbor_z_scores if neighbor is not sentinel)

Try just directly accessing the dict and catch KeyErrors, it might be faster depending on your hit/miss ratio:

# cache this object
ignode = interaction_graph.node
neighbor_z_scores = []
for neighbor in node_neighbors:
    try:
        neighbor_z_scores.append(ignode[neighbor]['weight'])
    except KeyError:
        pass

or with the getdefault and list comprehension:

sentinel = object()
# cache this object 
ignode = interaction_graph.node

neighbor_z_scores = (ignode[neighbor]['weight'] for neighbor in node_neighbors)
# using identity testing, it's slightly faster
neighbor_z_scores = (neighbor for neighbor in neighbor_z_scores if neighbor is not sentinel)
时光沙漏 2024-09-03 22:28:14

在不深入研究代码的情况下,尝试使用 itertools 提高一点速度。

在模块级别添加这些:

import itertools as it, operator as op
GET_WEIGHT= op.attrgetter('weight')

Change:

neighbors_in_selected_nodes = (neighbor for neighbor in
        node_neighbors if neighbor in selected_nodes)

into:

neighbors_in_selected_nodes = it.ifilter(selected_node.__contains__, node_neighbors)

和:

neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
        neighbor in neighbors_in_selected_nodes)

into:

neighbor_z_scores = (
    it.imap(
        GET_WEIGHT,
        it.imap(
            interaction_graph.node.__getitem__,
            neighbors_in_selected_nodes)
    )
)

这些有帮助吗?

Without looking deeply into your code, try adding a little speed with itertools.

Add these at the module level:

import itertools as it, operator as op
GET_WEIGHT= op.attrgetter('weight')

Change:

neighbors_in_selected_nodes = (neighbor for neighbor in
        node_neighbors if neighbor in selected_nodes)

into:

neighbors_in_selected_nodes = it.ifilter(selected_node.__contains__, node_neighbors)

and:

neighbor_z_scores = (interaction_graph.node[neighbor]['weight'] for
        neighbor in neighbors_in_selected_nodes)

into:

neighbor_z_scores = (
    it.imap(
        GET_WEIGHT,
        it.imap(
            interaction_graph.node.__getitem__,
            neighbors_in_selected_nodes)
    )
)

Do these help?

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文