用python逐行比较大文本文件

发布于 2024-10-27 08:29:35 字数 467 浏览 5 评论 0原文

我正在处理一些具有通用格式的大型(几百万行)生物信息学数据集:

chromosomeNumber locusStart locusStop sequence moreData

我还有这种格式的其他文件:

chromosomeNumber locusStart locusStop moreData

我需要做的是将每种类型的文件之一读入内存,并且如果 locusStart上层文件中的一行位于下层文件中任何行的开头和结尾之间,则将该行打印到输出文件 1。如果该行的 locusStart 不在底层文件中任何行的开头和结尾之间,然后将其打印到输出文件 2。

我当前正在读取文件,将它们转换为以染色体为键控的字典,并以相应的行作为值。然后,我将每个值行拆分为一个字符串,然后与字符串进行比较。这需要很长的时间,我想知道是否有更有效的方法来做到这一点。

谢谢。

I am working on some large (several million line) bioinformatics data sets with the general format:

chromosomeNumber locusStart locusStop sequence moreData

I have other files in this format:

chromosomeNumber locusStart locusStop moreData

What I need to be able to do is read one of each type of file into memory and if the locusStart of a line of the upper file is between the start and stop of any of the lines in the lower file, print the line to output file 1. If the locusStart of that line is not between the start and stop of any lines in the bottom file, then print it to output file 2.

I am currently reading the files in, converting them into dictionaries keyed on chromosome with the corresponding lines as values. I then split each value line into a string, and then do comparisons with the strings. This takes an incredibly long time, and I would like to know if there is a more efficient way to do it.

Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

起风了 2024-11-03 08:29:36

听起来您将进行大量大于/小于比较,因此我认为将数据加载到字典中根本不会提高代码的速度 - 根据您的解释听起来您仍在循环遍历一个文件或另一个文件中的每个元素。

您需要的是不同的数据结构来加载数据并运行比较操作。查看 Python bisect 模块,我认为它可能提供您需要更有效地运行比较操作的数据结构。


如果您可以更准确地描述您想要完成的任务,我们将能够帮助您开始编写代码。

It sounds like you are going to be doing lots of greater than/less than comparisons, as such I don't think loading your data into dictionaries is going to enhance the speed of code at all--based on what you've explained it sounds like you're still looping through every element in one file or the other.

What you need is a different data structure to load your data into and run comparison operations with. Check out the the Python bisect module, I think it may provide the data structure that you need to run your comparison operations much more efficiently.


If you can more precisely describe what exactly you're trying to accomplish, we'll be able to help you get started writing your code.

疯到世界奔溃 2024-11-03 08:29:36

使用染色体编号字典是一个好主意,只要您可以将两个文件放入内存即可。

然后,您想按 locusStart 对两个列表进行排序(拆分字符串,将 locusStart 转换为数字 - 请参阅 如果您不确定如何单独对 locusStart 进行排序,请参阅排序说明)。

现在您可以浏览列表:如果较低的 locusStart 小于第一个较高的 locusStart,则将该行放入文件 2 中,然后继续处理下一个。如果下位locusStart大于第一个上位locusStart,则

  • 当它也大于locusEnd时,丢弃上位列表的开头
  • 如果发现大于locusStart且小于locusEnd的情况,则将其放入文件1中
  • 否则,将其放入文件 2

这应该用 O(n log n) 算法替换现在可能的 O(n^2) 算法。

Using a dictionary of the chromosome number is a good idea, as long as you can fit both files into memory.

You then want to sort both lists by locusStart (split the string, convert locusStart to a number--see instructions on sorting if you're unsure how to sort on locusStart alone).

Now you can just walk through your lists: if the lower locusStart is less than the first upper locusStart, put the line in file 2 and go on to the next one. If the lower locusStart is greater than the first upper locusStart then

  • While it is also greater than locusEnd, throw away the beginning of the upper list
  • If you find a case where it's greater than locusStart and less than locusEnd, put it in file 1
  • Otherwise, put it in file 2

This should replace what is now probably an O(n^2) algorithm with a O(n log n) one.

空心↖ 2024-11-03 08:29:35

看来对于较低的文件(我假设它具有第二种格式),您关心的唯一字段是“locusStart”。由于从您的描述来看,您不一定关心其他数据,因此您可以制作一组所有 locusStart:

locusStart_list = set()
with open(upper_file, 'r') as f:
  for line in f:
    tmp_list = line.strip().split()
    locusStart_list.add(tmp_list[1])

这将删除您对底部文件所做的所有不必要的行操作。然后,您可以轻松地将字段的 locusStart 与从较低文件构建的集合进行比较。该集合还将删除重复项,使其比使用列表更快一些。

It seems that for the lower file (which I assuming has the second format), the only field you are concerned about is 'locusStart'. Since, from your description, you do not necessarily care about the other data, you could make a set of all of the locusStart:

locusStart_list = set()
with open(upper_file, 'r') as f:
  for line in f:
    tmp_list = line.strip().split()
    locusStart_list.add(tmp_list[1])

This removes all of the unnecessary line manipulation you do for the bottom file. Then, you can easily compare the locusStart of a field to the set built from the lower file. The set would also remove duplicates, making it a bit faster than using a list.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文