Python的一分子功能如何使用此元组作为参数？

发布于 2025-02-01 04:53:21 字数 2117 浏览 2 评论 0 原文

这是我关注的有关基因组学课程引起的问题。我们目前正在处理搜索方法，特别是如何在更长的DNA序列（ t ，还有一个字符串）。所讨论的方法涉及创建 hash表 t 由长度 k 及其位置组成的所有可能的子字符串（每个子字符串都是一个所谓的 kmer ）。然后，我们搜索带有 p 的相同长度（kmers）的子字符串的哈希表。这涉及创建索引类，以及最终使用Python的 bisect_left 函数。这是代码：

import bisect
import sys

class Index(object):

    def __init__(self, t, k):
        ''' Create index from all substrings of size 'length' '''
        self.k = k  # k-mer length (k)
        self.index = []
        for i in range(len(t) - k + 1):  # for each k-mer
            self.index.append((t[i:i+k], i))  # append (k-mer, offset) pair (tuple)
        self.index.sort()  # alphabetize by k-mer
    
    def query(self, p):
        ''' Return index hits for first k-mer of P '''
        kmer = p[:self.k]  # query with first k-mer
        i = bisect.bisect_left(self.index, (kmer, -1))  # binary search
        hits = []
        while i < len(self.index):  # collect matching index entries
            if self.index[i][0] != kmer:
                break
            hits.append(self.index[i][1])
            i += 1
        return hits

我的问题涉及 bisect_left函数的格式。 Python文档给出了此函数的语法，如下所示：
bisect.bisect_left（a，x，lo = 0，hi = len（a）， *，key = none）

我的课程中使用的BISECT_LEFT功能以这种方式格式化：
i = bisect.bisect_left（self.index，（kmer，-1））
我知道参数 a 已通过参数 self.index ，这是由 t 创建的哈希表。但是后来我被随后的元组感到困惑。我看不出以下任何参数如何接受元组。我可以想象，哈希表被组织为形式的元素（kmer，位置），但是这些条目都没有位置 -1 。讲师指出，他将值设置为-1的特定目的，如下所示：

“因此，第一步就是找到P的第一个K基础来在桌子中查找它们。因此，我要说kmer等于P长度为k，我们需要做自我。 Bisect 迅速地说i = bisect.bisect_left。 strong>元组，所以我将通过 kmer ，然后我将 -1 和放置。列表中的索引大于负数，这确保我们始终获得第一次出现。”

我的意思是，我了解英语，但我看不出一分位数（a）将元组作为参数，并且（b）假设它可以（显然可以），它如何使用位置 - 1 ，在哈希表中不存在？我有一个直觉，我没有意识到分类的情况，但我不能把手放在上面。

非常感谢，抱歉，我的问题一定是太长的声明。

原文

This is a question arising from a Coursera course on genomics that I'm following. We are currently dealing with search methods, specifically how to detect (align) all occurrences of a shorter DNA sequence (p, a string) within a longer DNA sequence (t, also a string). The approach in question involves the creation of a hash table for t composed of all possible substrings of length k AND their positions (each substring is a so-called kmer). We then search the hash table with substrings of the same length (kmers) derived from p. This involves the creation of an Index class and the eventual use of Python's bisect_left function. Here's the code:

import bisect
import sys

class Index(object):

    def __init__(self, t, k):
        ''' Create index from all substrings of size 'length' '''
        self.k = k  # k-mer length (k)
        self.index = []
        for i in range(len(t) - k + 1):  # for each k-mer
            self.index.append((t[i:i+k], i))  # append (k-mer, offset) pair (tuple)
        self.index.sort()  # alphabetize by k-mer
    
    def query(self, p):
        ''' Return index hits for first k-mer of P '''
        kmer = p[:self.k]  # query with first k-mer
        i = bisect.bisect_left(self.index, (kmer, -1))  # binary search
        hits = []
        while i < len(self.index):  # collect matching index entries
            if self.index[i][0] != kmer:
                break
            hits.append(self.index[i][1])
            i += 1
        return hits

My question concerns the format of the bisect_left function. Python docs gives the syntax for this function as follows:
bisect.bisect_left(a, x, lo=0, hi=len(a), *, key=None)
The bisect_left function used in my course is formatted in this way:
i = bisect.bisect_left(self.index, (kmer, -1))
I understand that the parameter a has been passed the argument self.index, which is the hash table created from t. But then I'm flummoxed by the tuple which follows. I don't see how any of the following parameters accepts a tuple. I can imagine that the hash table is organized as tuples of the form (kmer, position), but none of these entries will have the position -1. The instructor states that he sets the value to -1 for a specific purpose as follows:

"So the first step is to just find the first k basis of p to look them up in the tables. So, I'm going to say kmer equals p up to length k, and we need to do self.k. And now I want to find the first position in the list where this kmer occurs, so I'm going to use bisect to do that quickly. And say i = bisect.bisect_left. And I pass in the list that we're searching for, and then the object that we're searching for. And remember this is a list of tuples so I'm going to pass in the kmer, and then I'm going to put -1. And since all the indices in the list are greater than negative one, this assures that we always get the first occurrence of that."

I mean, I understand the English, but I don't see how bisect can (a) take in a tuple as an argument, and (b) assuming it can (obviously it can), how can it make use of a position, -1, that doesn't exist in the hash table? I have a hunch I'm not cluing in to what's going on in terms of sorting, but I can't put my finger on it.

Many thanks, and sorry for what must be the overly long statement of my question.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

骷髅 2025-02-08 04:53:21

在Python中，您可以在词典上比较元素（如词典中）。首先，比较第一个元素，然后是第二个元素，依此类推。

>>> (1, 2, 3) < (1, 2, 4)
True
>>> (1, 3, 3) < (1, 2, 4)
False
>>> ('a', 'p', 'p', 'l', 'e') > ('a', 'r', 't')
False
>>>

bisect.bisect_left 不限于整数或字符串，它可以与可以比较的任何对象一起使用。实际上，您可以在此处查看它的实现：

作为

例如，此列表进行排序：

[("a", 1), ("b", -42), ("c", 2), ("c", 3)]

这不是：

[("a", 1), ("c", 3),  ("b", -42), ("c", 3)]

使用 bisect.bisect_left 与元组进行示例：

from bisect import bisect_left

# The shopping list must be ordered by food type first, by
# quantity second.
shopping_list = [("apples", 100), ("bananas", 42), ("grapes", 250)]

# Say, we want to figure out where to put our cherries.
print(bisect_left(shopping_list, ("cherries", 666)))  # 2
# Our new list might look like this:
[("apples", 100), ("bananas", 42), ("cherries", 666), ("grapes", 250)]
# (just for illustration, bisect_left doesn't change the list)

# What if we want to find out where the bananas start?
print(bisect_left(shopping_list, ("bananas",)))  # 1

感谢Kelly Bundy的评论：（KMER，）是一个更好的值使用（Kmer，-1）。（“ foo”，）确实小于（“ foo”，whyther）。如果您的要求更改并且现在 -1 是有效的第二个元素，则更好。

ps self.index 列表不是哈希表。这只是一个分类的元组列表。它没有任何特殊的搜索功能（除了允许二进制搜索，这就是您正在做的事情）。有 Brandon Rhodes 和这 SO问题用于说明哈希表。

In Python, you can compare tuples lexicographically (like in a dictionary). First, the first elements are compared, then the second elements, and so on.

>>> (1, 2, 3) < (1, 2, 4)
True
>>> (1, 3, 3) < (1, 2, 4)
False
>>> ('a', 'p', 'p', 'l', 'e') > ('a', 'r', 't')
False
>>>

bisect.bisect_left is not restricted to integers or strings, and it works with any objects that can be compared. In fact, you can see how it's implemented here: GitHub link

As the documentation on bisect_left says, it searches for the index at which to insert the new tuple so that the list is still sorted.

For instance, this list is sorted:

[("a", 1), ("b", -42), ("c", 2), ("c", 3)]

And this is not:

[("a", 1), ("c", 3),  ("b", -42), ("c", 3)]

Example of using bisect.bisect_left with tuples:

from bisect import bisect_left

# The shopping list must be ordered by food type first, by
# quantity second.
shopping_list = [("apples", 100), ("bananas", 42), ("grapes", 250)]

# Say, we want to figure out where to put our cherries.
print(bisect_left(shopping_list, ("cherries", 666)))  # 2
# Our new list might look like this:
[("apples", 100), ("bananas", 42), ("cherries", 666), ("grapes", 250)]
# (just for illustration, bisect_left doesn't change the list)

# What if we want to find out where the bananas start?
print(bisect_left(shopping_list, ("bananas",)))  # 1

Thanks to Kelly Bundy's comment: (kmer,) is a better value to use than (kmer, -1). ("foo",) is indeed less than ("foo", whatever). This is better if your requirements change and now -1 is a valid second element.

P.S. The self.index list is not a hash table. It's simply a sorted list of tuples. It doesn't have any special searching capabilities (besides allowing binary search, which is what you're doing). There is a great talk by Brandon Rhodes and this SO question for an explanation of hash tables.