从文件中提取特定的行集

发布于 2024-09-14 19:42:54 字数 463 浏览 7 评论 0原文

我有许多大型（每份约 30 MB）制表符分隔的文本文件，其中行的宽度可变。我想从第 n 行（此处，n=4）和倒数第二行（最后一行为空）中提取第二个字段。我可以使用 awk: and 分别获取它们

awk 'NR==4{print $2}' filename.dat

（我不完全理解这一点，但是）

awk '{y=x "\n" $2};END{print y}' filename.dat

但是有没有一种方法可以在一次调用中将它们组合在一起？我更广泛的意图是将其包装在 Python 脚本中，以从单独目录中的大量文件（数千个）中获取这些值，并且我希望减少系统调用的数量。非常感谢 -

编辑：我知道我可以使用 Python 读取整个文件来提取这些值，但认为 awk 可能更适合该任务（与位于的两个值之一有关）靠近大文件的末尾）。

原文

I have many large (~30 MB a piece) tab-delimited text files with variable-width lines. I want to extract the 2nd field from the nth (here, n=4) and next-to-last line (the last line is empty). I can get them separately using awk:

awk 'NR==4{print $2}' filename.dat

and (I don't comprehend this entirely but)

awk '{y=x "\n" $2};END{print y}' filename.dat

but is there a way to get them together in one call? My broader intention is to wrap it in a Python script to harvest these values from a large number of files (many thousands) in separate directories and I want to reduce the number of system calls. Thanks a bunch -

Edit: I know I can read over the whole file with Python to extract those values, but thought awk might be more appropriate for the task (having to do with one of the two values located near the end of the large file).

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

生死何惧 2024-09-21 19:42:54

awk 'NR==4{print $2};{y=x "\n" $2};END{print y}' filename.dat

awk 'NR==4{print $2};{y=x "\n" $2};END{print y}' filename.dat

回复收藏 0 原文

兲鉂ぱ嘚淚 2024-09-21 19:42:54

您可以将行数传递给 awk：

awk -v lines=$( wc -l < filename.dat ) -v n=4 '
    NR == n || NR == lines-1 {print $2}
' filename.dat

注意，在 wc 命令中，使用 < 重定向以避免打印文件名。

You can pass the number of lines into awk:

awk -v lines=$( wc -l < filename.dat ) -v n=4 '
    NR == n || NR == lines-1 {print $2}
' filename.dat

Note, in the wc command, use the < redirection to avoid the filename being printed.

回复收藏 0 原文

一萌ing 2024-09-21 19:42:54

以下是如何在 Python 中实现此功能，而无需读取整个文件。

要获取第 n 行，您别无选择，只能读取文件直至第 n 行，因为这些行的宽度是可变的。

要获取倒数第二行，请猜测该行可能有多长（慷慨一点）并在文件末尾之前查找那么多字节。

read() 从您想要的位置开始。计算换行符的数量 - 您至少需要两个。如果少于 2 个换行符，则将您的猜测加倍，并再次尝试

拆分您在换行符处读取的数据 - 您查找的行将是拆分中倒数第二个项目

回复收藏 0 原文

飘落散花 2024-09-21 19:42:54

这是我的Python解决方案。受到此其他代码的启发：

def readfields(filename,nfromtop=3,nfrombottom=-2,fieldnum=1,blocksize=4096):
    f = open(filename,'r')
    out = ''
    for i,line in enumerate(f):
        if i==nfromtop:
            out += line.split('\t')[fieldnum]+'\t'
            break
    f.seek(-blocksize,2)
    out += str.split(f.read(blocksize),'\n')[nfrombottom].split('\t')[fieldnum]
    return out

当我分析时它比调用 awk (awk 'NR==4{print $2};{y=x $2};END{print y}' filename.dat) 的解决方案快 0.09 秒子流程模块。这不是一个破坏性的事情，但是当脚本的其余部分是用 Python 编写时，似乎有一个回报（特别是因为我有很多这样的文件）。

This is my solution in Python. Inspired by this other code:

def readfields(filename,nfromtop=3,nfrombottom=-2,fieldnum=1,blocksize=4096):
    f = open(filename,'r')
    out = ''
    for i,line in enumerate(f):
        if i==nfromtop:
            out += line.split('\t')[fieldnum]+'\t'
            break
    f.seek(-blocksize,2)
    out += str.split(f.read(blocksize),'\n')[nfrombottom].split('\t')[fieldnum]
    return out

When I profiled it, the difference was 0.09 seconds quicker than a solution calling awk (awk 'NR==4{print $2};{y=x $2};END{print y}' filename.dat) with the subprocess module. Not a dealbreaker, but when the rest of the script is in Python it appears there is a payoff in going there (especially since I have a lot of these files).

回复收藏 0 原文

~没有更多了~