原生 Python 中的 DNA 序列比对（无 biopython）

发布于 2024-08-24 06:04:55 字数 554 浏览 14 评论 0原文

我有一个有趣的遗传学问题，我想用原生 Python 来解决（没有标准库之外的东西）。这是为了使该解决方案能够非常容易地在任何计算机上使用，而不需要用户安装额外的模块。

这里是。我从 454 次新一代测序运行中收到了 100,000 条 DNA 序列（最多 20 亿条）。我想修剪四肢，以去除两端可能存在的引物，无论是正常序列还是有义序列。示例：

seq001: ACTGACGGATAGCTGACCTGATGATGGGTTGACCAGTGATC
        --primer-1---                 --primer-2-

引物可以出现一次或多次（一个接一个）。正常意义总是在左边，反之意义总是在右边。因此，我的目标是找到引物，剪切序列，仅保留无引物的部分。为此，我想使用一种经典的对齐算法（即：Smith-Waterman），该算法已在本机 Python 中实现（即：不是通过 biopython）。我知道这可能需要相当长的时间（最多几个小时）。

注意：这不是直接的“单词”搜索，因为序列和引物中的 DNA 都可能因各种技术原因而“突变”。

你会用什么？

原文

I have an interesting genetics problem that I would like to solve in native Python (nothing outside the standard library). This in order for the solution to be very easy to use on any computer, without requiring the user to install additional modules.

Here it is. I received 100,000s of DNA sequences (up to 2 billion) from a 454 new generation sequencing run. I want to trim the extremities in order to remove primers that may be present on both ends, both as normal and sense sequences. Example:

seq001: ACTGACGGATAGCTGACCTGATGATGGGTTGACCAGTGATC
        --primer-1---                 --primer-2-

Primers can be present one or multiple times (one right after the other). Normal sense are always on the left, and reverse on the right. My goal is thus to find the primers, cut the sequence such that only the primer-free part remains. For this, I want to use a classic alignment algorithm (ie: Smith-Waterman) that has been implemented in native Python (ie: not through biopython). I am aware of the fact that this may require quite some time (up to hours).

Note: This is NOT a direct "word" search, as DNA, both in the sequences and the primers, can be "mutated" for diverse technical reasons.

What would you use?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

天涯离梦残月幽梦 2024-08-31 06:04:55

简单研究一下该算法，这并不是一件容易的事。这将需要一些非常认真的算法工作。尝试将您的期望从“几小时”调整为“几天或几周”。

实现此功能的程序员需要：

具备较高的一般 Python 编程能力
算法编程经验，并且对时间复杂度有很好的理解。
很好地理解 dict、set、deque 等 Python 数据结构及其复杂性特征。
熟悉单元测试。

那个程序员现在可能是也可能不是你。这听起来是一个很棒的项目，祝你好运！

回复收藏 0 原文

没有心的人 2024-08-31 06:04:55

这是一篇关于该主题的论文：

Rocke，《关于寻找小说空白》 DNA 序列中的基序，1998。

希望从该论文及其参考文献以及引用上述内容的其他论文中，您可以找到许多关于算法的想法。您不会找到 Python 代码，但您可能会找到可以在 Python 中实现的算法描述。

回复收藏 0 原文

倚栏听风 2024-08-31 06:04:55

您可以使用正则表达式非常简单地做到这一点吗？我认为事情不会那么复杂！事实上，我刚刚完成了一些代码，为这里大学的一个人做与此几乎相同的事情！

如果不寻找引物的精确副本，由于突变，则可以应用模糊匹配的元素！我所做的版本非常简单地在开始和结束处查找精确的引物匹配，并使用以下代码返回减去这些引物的值：

pattern = "^" + start_primer + "([A-Z]+)" + end_primer + "$" # start primer and end primer are sequences you are looking to match
regex = re.match(pattern, sequence) # sequence is the DNA sequence you are analyzing
print regex.group(1) # prints the sequence between the start and end primers

这是关于 python 中模糊正则表达式的链接 http://hackerboss.com/approximate-regex-matching-in-python/

You could do this quite simply using regex? I don't think it would be that complicated! In fact, I have just completed some code to do something pretty much the same as this for one of the guys at the university here!

If not looking for exact copies of the primers, due to mutation then an element of fuzzy matching could be applied! The version I did very simply looked for exact primer matches at the start and end and returned the value minus those primers using the following code:

pattern = "^" + start_primer + "([A-Z]+)" + end_primer + "$" # start primer and end primer are sequences you are looking to match
regex = re.match(pattern, sequence) # sequence is the DNA sequence you are analyzing
print regex.group(1) # prints the sequence between the start and end primers

Here's a link on fuzzy regex in python http://hackerboss.com/approximate-regex-matching-in-python/

回复收藏 0 原文

~没有更多了~