当前位置：文江博客话题详情

如何对文件进行子集化 - 选择多个行或列

发布于 2024-11-17 11:10:26 字数 303 浏览 3 评论 0原文

我希望获得您关于如何对大文件（数百万行或数百万行）进行子集化的建议/帮助。

例如，

（1）我有一个大文件（数百万行，制表符分隔）。我想要此文件的子集，其中仅包含 10000 到 100000 行。

(2) 我有一个大文件（数百万列，制表符分隔）。我想要这个文件的一个子集，其中只有从 10000 到 100000 的列。

我知道有诸如 head、tail、cut、split 以及 awk 或 sed 之类的工具。我可以用它们来做简单的子集设置。但是，我不知道如何做这项工作。

您能给什么建议吗？提前致谢。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

﹏雨一样淡蓝的深情 2024-11-24 11:10:26

过滤行很容易，例如使用 AWK：

cat largefile | awk 'NR >= 10000  && NR <= 100000 { print }'

使用 CUT 过滤列更容易：

cat largefile | cut -d '\t' -f 10000-100000

正如 Rahul Dravid 提到的，cat 在这里不是必须的，正如 Zsolt Botykai 所补充的，您可以使用以下方法提高性能：

awk 'NR > 100000 { exit } NR >= 10000 && NR <= 100000' largefile
cut -d '\t' -f 10000-100000 largefile

Filtering rows is easy, for example with AWK:

cat largefile | awk 'NR >= 10000  && NR <= 100000 { print }'

Filtering columns is easier with CUT:

cat largefile | cut -d '\t' -f 10000-100000

As Rahul Dravid mentioned, cat is not a must here, and as Zsolt Botykai added you can improve performance using:

awk 'NR > 100000 { exit } NR >= 10000 && NR <= 100000' largefile
cut -d '\t' -f 10000-100000 largefile

回复收藏 0 原文

唯憾梦倾城 2024-11-24 11:10:26

一些不同的解决方案：

对于行范围：
在 sed 中：

sed -n 10000,100000p somefile.txt

对于 awk 中的列范围：

awk -v f=10000 -v t=100000 '{ for (i=f; i<=t;i++) printf("%s%s", $i,(i==t) ? "\n" : OFS) }' details.txt

Some different solutions:

For row ranges:
In sed :

sed -n 10000,100000p somefile.txt

For column ranges in awk:

awk -v f=10000 -v t=100000 '{ for (i=f; i<=t;i++) printf("%s%s", $i,(i==t) ? "\n" : OFS) }' details.txt

回复收藏 0 原文

幻梦 2024-11-24 11:10:26

对于第一个问题，从一个大文件中选择一组行，从尾部到头部进行管道传输非常简单。您需要从第 10000 行开始的大型文件中的 90000 行。tail 抓取从第 10000 行开始的大型文件的后端，然后 head 砍掉除前 90000 行之外的所有行。

tail -n +10000 largefile | head -n 90000 -

For the first problem, selecting a set of rows from a large file, piping tail to head is very simple. You want 90000 rows from largefile starting at row 10000. tail grabs the back end of largefile starting at row 10000 and then head chops off all but the first 90000 rows.

tail -n +10000 largefile | head -n 90000 -

回复收藏 0 原文

暮光沉寂 2024-11-24 11:10:26

被 sed 解决方案击败，所以我将发布一个 perl dito 来代替。
打印选定的行。

$ seq 100 | perl -ne 'print if $. >= 10 && $. <= 20' 
10
11
12
13
14
15
16
17
18
19
20

要打印选择性列，请使用

perl -lane 'print $F[1] .. $F[3] '

-F 与 -a 结合使用，以选择分割行的分隔符。

要进行测试，请使用 seq 和 paste 生成一些列让

$ seq 50 | paste - - - - -
1   2   3   4   5
6   7   8   9   10
11  12  13  14  15
16  17  18  19  20
21  22  23  24  25
26  27  28  29  30
31  32  33  34  35
36  37  38  39  40
41  42  43  44  45
46  47  48  49  50

我们打印除第一列和最后一列之外的所有内容

$ seq 50 | paste - - - - - | perl -lane 'print join "   ", $F[1] .. $F[3]'
2   3   4
7   8   9
12  13  14
17  18  19
22  23  24
27  28  29
32  33  34
37  38  39
42  43  44
47  48  49

在上面的 join 语句中，有是一个选项卡，您可以通过执行 ctrl-v 选项卡来获取它。

Was beaten to it for the sed solution, so I'll post a perl dito instead.
To print selected lines.

$ seq 100 | perl -ne 'print if $. >= 10 && $. <= 20' 
10
11
12
13
14
15
16
17
18
19
20

To print selective columns, use

perl -lane 'print $F[1] .. $F[3] '

-F is used in conjunction with -a, to choose the delimiter on which to split lines.

To test, use seq and paste to get generate some columns

$ seq 50 | paste - - - - -
1   2   3   4   5
6   7   8   9   10
11  12  13  14  15
16  17  18  19  20
21  22  23  24  25
26  27  28  29  30
31  32  33  34  35
36  37  38  39  40
41  42  43  44  45
46  47  48  49  50

Lets's print everything except the first and the last column

$ seq 50 | paste - - - - - | perl -lane 'print join "   ", $F[1] .. $F[3]'
2   3   4
7   8   9
12  13  14
17  18  19
22  23  24
27  28  29
32  33  34
37  38  39
42  43  44
47  48  49

In the join statement above, there is a tab, you get it by doing a ctrl-v tab.

回复收藏 0 原文

~没有更多了~