关于处理大文件的建议 - python 还是命令行?
给定两个文件,一个包含以下形式的条目:
label1 label2 name1
label1 label3 name2
另一个包含以下形式的条目:
label1 label2 name1 0.1 1000
label9 label6 name7 0.8 0.5
假设您要从文件二中提取前三个元素出现在文件一的一行中(顺序很重要)的行 - 有关如何进行的任何建议这可能很快就会成为圆顶吗?
给定上述示例数据的任何此类脚本的输出文件将是:
label1 label2 name1 0.1 1000
I toyed with python:
inp = open(file1.txt, 'r')
look_up = [i.split() for i in inp.readlines()]
inp.close()
inp = open('file2', 'wt')
holder = []
line = inp.readline()
while line:
line = line.split()
if [line[0], line[1], line[2]] in look_up:
holder.append(line)
line = inp.readline()
但这似乎需要一段时间。这些文件相当大。
谢谢!
Given two files, one containing entries of the form:
label1 label2 name1
label1 label3 name2
and the other of the form:
label1 label2 name1 0.1 1000
label9 label6 name7 0.8 0.5
Suppose you want to extract those lines from file two for which the first three elements appear in a line (order important) in file one - any suggestions on how this might be dome quickly?
The output file from any such script given the above sample data would be:
label1 label2 name1 0.1 1000
I toyed with python:
inp = open(file1.txt, 'r')
look_up = [i.split() for i in inp.readlines()]
inp.close()
inp = open('file2', 'wt')
holder = []
line = inp.readline()
while line:
line = line.split()
if [line[0], line[1], line[2]] in look_up:
holder.append(line)
line = inp.readline()
However this seems to take a while. These files are rather large.
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您的 python 版本效率相当低,因为您正在测试列表中的成员资格,而不是集合或字典(即 O(n) 查找时间而不是 O(1))。
尝试使用元组
集合
或字符串集合
。元组将是更好的选择,因为两个文件可以在不同的分隔符上分割,但我认为您不会看到特别大的性能差异。与测试很长列表的成员资格相比,tuple('something'.split()) 相对较快。此外,无需调用
inp.readlines()
。换句话说,您可以这样做,并且您应该看到显着的加速,而无需更改代码的任何其他部分,除了
tuple(line[:3])
而不是[line[ 0],行[1],行[2]]
。实际上,grep 和 bash 对此非常完美...(未经测试,但应该可以工作。)
要查看哪一个更快,我们可以 生成一些测试数据(
file1.txt
中约 4500 个键,file2.txt
中 1000000 行),并对简单的 Python 版本进行基准测试相同的东西(粗略地......这些行将以与 grep 版本不同的顺序打印。)。快约 70 倍。
事实证明,Python 版本比vs.
当然,这两个示例以完全不同的方式解决问题。我们实际上是在比较两种算法,而不是两种实现。例如,如果
file1
中只有几行关键行,则 bash/grep 解决方案很容易胜出。(bash 是否有某种内置容器,可以 O(1) 查找成员资格?(我认为 bash 4 可能有一个哈希表,但我对此一无所知......)尝试实现会很有趣与上面 bash 中的 python 示例类似的算法,以及......)
Your python version is rather inefficient because you're testing for membership in a list, rather than a set or a dict (i.e. O(n) lookup time instead of O(1)).
Try using a
set
of tuples or aset
of strings instead. Tuples would be a better choice as the two files could be split on different delimiters, but I don't think you'll see a particularly large performance difference.tuple('something'.split())
is relatively fast compared to testing for the membership of a very long list.Also, there's no need to call
inp.readlines()
. In other words, you could just doAnd you should see a significant speedup without having to change any other parts of your code other than
tuple(line[:3])
rather than[line[0], line[1], line[2]]
.Actually, grep and bash are pretty perfect for this... (Untested, but it should work.)
To see which one is faster, we can generate some test data (~4500 keys in
file1.txt
and 1000000 lines infile2.txt
), and benchmark a simple python version of same thing (Roughly... The lines will be printed in a different order than the grep version.).The python version turns out to be ~70x faster:
vs.
Of course, the two examples are approaching the problem in entirely different ways. We're really comparing two algorithms, not two implementations. For example, if we only have a couple of key lines in
file1
, the bash/grep solution easily wins.(Does bash have a builtin container of some sort with O(1) lookup for membership? (I think bash 4 might have a hash table, but I don't know anything about it...) It would be interesting to try implementing a similar algorithm to the python example above in bash, as well...)
Hacky bash/sort/Perl 解决方案:
它将两个文件合并到一个列表中,对其进行排序(因此我们获得具有相同键的数据块,从文件 1 开始按行引导),然后使用特殊的 Perl oneliner 只留下格式良好的行具有文件 1 中的前置“标头”。
Hacky bash/sort/Perl solution:
It merges both files into one list, sorts it (so we get chunks of data with the same key, lead by line from file 1), then use special Perl oneliner to leave only well-formed lines that have precending "header" from file 1.
您可以尝试使用字符串“label1 label2 name1”作为键,而不是值的三元组。
You can try using the string "label1 label2 name1" as a key, not that triplet of values.
我将使用哈希来存储第一个文件中的值。虽然不是那么容错(每个项目之间只有 1 个空格),但您会得到总体思路......
I'd use a hash to store the value from the first file. Not that error-resilience though (1 and only 1 space between each item) but you'll get the general idea...