关于处理大文件的建议 - python 还是命令行?

发布于 2024-12-03 09:29:05 字数 737 浏览 8 评论 0原文

给定两个文件,一个包含以下形式的条目:

label1 label2 name1
label1 label3 name2

另一个包含以下形式的条目:

label1 label2 name1 0.1 1000
label9 label6 name7 0.8 0.5

假设您要从文件二中提取前三个元素出现在文件一的一行中(顺序很重要)的行 - 有关如何进行的任何建议这可能很快就会成为圆顶吗?

给定上述示例数据的任何此类脚本的输出文件将是:

label1 label2 name1 0.1 1000

I toyed with python:

inp = open(file1.txt, 'r')
look_up = [i.split() for i in inp.readlines()]
inp.close()

inp = open('file2', 'wt')

holder = []

line = inp.readline()
while line:
    line = line.split()
    if [line[0], line[1], line[2]] in look_up:
        holder.append(line)
    line = inp.readline()

但这似乎需要一段时间。这些文件相当大。

谢谢!

Given two files, one containing entries of the form:

label1 label2 name1
label1 label3 name2

and the other of the form:

label1 label2 name1 0.1 1000
label9 label6 name7 0.8 0.5

Suppose you want to extract those lines from file two for which the first three elements appear in a line (order important) in file one - any suggestions on how this might be dome quickly?

The output file from any such script given the above sample data would be:

label1 label2 name1 0.1 1000

I toyed with python:

inp = open(file1.txt, 'r')
look_up = [i.split() for i in inp.readlines()]
inp.close()

inp = open('file2', 'wt')

holder = []

line = inp.readline()
while line:
    line = line.split()
    if [line[0], line[1], line[2]] in look_up:
        holder.append(line)
    line = inp.readline()

However this seems to take a while. These files are rather large.

Thanks!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

酒与心事 2024-12-10 09:29:05

您的 python 版本效率相当低,因为您正在测试列表中的成员资格,而不是集合或字典(即 O(n) 查找时间而不是 O(1))。

尝试使用元组集合或字符串集合。元组将是更好的选择,因为两个文件可以在不同的分隔符上分割,但我认为您不会看到特别大的性能差异。与测试很长列表的成员资格相比,tuple('something'.split()) 相对较快。

此外,无需调用 inp.readlines()。换句话说,您可以这样做,

look_up = set(tuple(line.split()) for line in inp)

并且您应该看到显着的加速,而无需更改代码的任何其他部分,除了 tuple(line[:3]) 而不是 [line[ 0],行[1],行[2]]

实际上,grep 和 bash 对此非常完美...(未经测试,但应该可以工作。)

while read line
do
    grep "$line" "file2.txt"
done < "file1.txt"

要查看哪一个更快,我们可以 生成一些测试数据file1.txt 中约 4500 个键,file2.txt 中 1000000 行),并对简单的 Python 版本进行基准测试相同的东西(粗略地......这些行将以与 grep 版本不同的顺序打印。)。

with open('file1.txt', 'r') as keyfile:
    lookup = set(tuple(line.split()) for line in keyfile)

with open('file2.txt', 'r') as datafile:
    for line in datafile:
        if tuple(line.split()[:3]) in lookup:
            print line,

快约 70 倍。

jofer@cornbread:~/so> time sh so_temp149.sh > a

real    1m47.617s
user    0m51.199s
sys     0m54.391s

事实证明,Python 版本比vs.

jofer@cornbread:~/so> time python so_temp149.py > b

real    0m1.631s
user    0m1.558s
sys     0m0.071s

当然,这两个示例以完全不同的方式解决问题。我们实际上是在比较两种算法,而不是两种实现。例如,如果 file1 中只有几行关键行,则 bash/grep 解决方案很容易胜出。

(bash 是否有某种内置容器,可以 O(1) 查找成员资格?(我认为 bash 4 可能有一个哈希表,但我对此一无所知......)尝试实现会很有趣与上面 bash 中的 python 示例类似的算法,以及......)

Your python version is rather inefficient because you're testing for membership in a list, rather than a set or a dict (i.e. O(n) lookup time instead of O(1)).

Try using a set of tuples or a set of strings instead. Tuples would be a better choice as the two files could be split on different delimiters, but I don't think you'll see a particularly large performance difference. tuple('something'.split()) is relatively fast compared to testing for the membership of a very long list.

Also, there's no need to call inp.readlines(). In other words, you could just do

look_up = set(tuple(line.split()) for line in inp)

And you should see a significant speedup without having to change any other parts of your code other than tuple(line[:3]) rather than [line[0], line[1], line[2]].

Actually, grep and bash are pretty perfect for this... (Untested, but it should work.)

while read line
do
    grep "$line" "file2.txt"
done < "file1.txt"

To see which one is faster, we can generate some test data (~4500 keys in file1.txt and 1000000 lines in file2.txt), and benchmark a simple python version of same thing (Roughly... The lines will be printed in a different order than the grep version.).

with open('file1.txt', 'r') as keyfile:
    lookup = set(tuple(line.split()) for line in keyfile)

with open('file2.txt', 'r') as datafile:
    for line in datafile:
        if tuple(line.split()[:3]) in lookup:
            print line,

The python version turns out to be ~70x faster:

jofer@cornbread:~/so> time sh so_temp149.sh > a

real    1m47.617s
user    0m51.199s
sys     0m54.391s

vs.

jofer@cornbread:~/so> time python so_temp149.py > b

real    0m1.631s
user    0m1.558s
sys     0m0.071s

Of course, the two examples are approaching the problem in entirely different ways. We're really comparing two algorithms, not two implementations. For example, if we only have a couple of key lines in file1, the bash/grep solution easily wins.

(Does bash have a builtin container of some sort with O(1) lookup for membership? (I think bash 4 might have a hash table, but I don't know anything about it...) It would be interesting to try implementing a similar algorithm to the python example above in bash, as well...)

﹏半生如梦愿梦如真 2024-12-10 09:29:05

Hacky bash/sort/Perl 解决方案:

$ cat > 1
label1 label2 name1
label1 label3 name2

$ cat > 2
label1 label2 name1 0.1 1000
label9 label6 name7 0.8 0.5

$ (cat 1; cat 2; ) | sort | perl -ne 'INIT{$pattern_re="(?:\\S+) (?:\\S+) (?:\\S+)"; $current_pattern="";} if(/^($pattern_re)$/o){$current_pattern=$1} else {if(/^($pattern_re)/o) { print if $1 eq $current_pattern} }'
label1 label2 name1 0.1 1000

它将两个文件合并到一个列表中,对其进行排序(因此我们获得具有相同键的数据块,从文件 1 开始按行引导),然后使用特殊的 Perl oneliner 只留下格式良好的行具有文件 1 中的前置“标头”。

Hacky bash/sort/Perl solution:

$ cat > 1
label1 label2 name1
label1 label3 name2

$ cat > 2
label1 label2 name1 0.1 1000
label9 label6 name7 0.8 0.5

$ (cat 1; cat 2; ) | sort | perl -ne 'INIT{$pattern_re="(?:\\S+) (?:\\S+) (?:\\S+)"; $current_pattern="";} if(/^($pattern_re)$/o){$current_pattern=$1} else {if(/^($pattern_re)/o) { print if $1 eq $current_pattern} }'
label1 label2 name1 0.1 1000

It merges both files into one list, sorts it (so we get chunks of data with the same key, lead by line from file 1), then use special Perl oneliner to leave only well-formed lines that have precending "header" from file 1.

墟烟 2024-12-10 09:29:05

您可以尝试使用字符串“label1 label2 name1”作为键,而不是值的三元组。

You can try using the string "label1 label2 name1" as a key, not that triplet of values.

雨落星ぅ辰 2024-12-10 09:29:05

我将使用哈希来存储第一个文件中的值。虽然不是那么容错(每个项目之间只有 1 个空格),但您会得到总体思路......

#!/usr/bin/env python

labels={}
with open('log') as fd:
    for line in fd:
        line=line.strip()
        labels[line]=True

with open('log2') as fd:
    for line in fd:
        if " ".join(line.split()[0:3]) in labels:
            print line

I'd use a hash to store the value from the first file. Not that error-resilience though (1 and only 1 space between each item) but you'll get the general idea...

#!/usr/bin/env python

labels={}
with open('log') as fd:
    for line in fd:
        line=line.strip()
        labels[line]=True

with open('log2') as fd:
    for line in fd:
        if " ".join(line.split()[0:3]) in labels:
            print line
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文