使用 Awk 从分隔文件中提取特定列

发布于 2024-12-11 10:30:40 字数 219 浏览 0 评论 0原文

抱歉，如果这太基础了。我有一个 csv 文件，其中的列有标题行（v1、v2 等）。我知道要提取第 1 列和第 2 列，我必须执行以下操作： awk -F "," '{print $1 "," $2}' infile.csv > outfile.csv。但是，如果我必须提取第 1 至 10、20 至 25 以及 30、33 列，该怎么办？作为附录，有没有办法直接使用标题名称而不是列号来提取？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

递刀给你 2024-12-18 10:30:40

我不知道是否可以在 awk 中进行范围操作。您可以执行 for 循环，但必须添加处理来过滤掉不需要的列。这样做可能更容易：

awk -F, '{OFS=",";print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$20,$21,$22,$23,$24,$25,$30,$33}' infile.csv > outfile.csv

还有其他需要考虑的事情 - 而且更快更简洁：

cut -d "," -f1-10,20-25,30-33 infile.csv > outfile.csv

至于你问题的第二部分，我可能会用 perl 编写一个脚本，它知道如何处理标题行，从 stdin 解析列名称或文件，然后进行过滤。它可能是我想要用于其他事情的工具。我不确定是否可以用单线来做，尽管我确信可以做到。

I don't know if it's possible to do ranges in awk. You could do a for loop, but you would have to add handling to filter out the columns you don't want. It's probably easier to do this:

awk -F, '{OFS=",";print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$20,$21,$22,$23,$24,$25,$30,$33}' infile.csv > outfile.csv

something else to consider - and this faster and more concise:

cut -d "," -f1-10,20-25,30-33 infile.csv > outfile.csv

As to the second part of your question, I would probably write a script in perl that knows how to handle header rows, parsing the columns names from stdin or a file and then doing the filtering. It's probably a tool I would want to have for other things. I am not sure about doing in a one liner, although I am sure it can be done.

回复收藏 0 原文

枕梦 2024-12-18 10:30:40

正如@Tom 提到的，cut 和 awk 方法实际上不适用于带引号字符串的 CSV。另一种选择是 python 模块，它提供命令行工具 csvfilter。它的工作方式类似于 cut，但可以正确处理 CSV 列引用：

csvfilter -f 1,3,5 in.csv > out.csv

如果您有 python（并且您应该），您可以像这样简单地安装它：

pip install csvfilter

请注意 csvfilter 中的列索引以 0 开头（与 awk 不同，awk 以 $1 开头））。更多信息请访问 https://github.com/codeinthehole/csvfilter/

As mentioned by @Tom, the cut and awk approaches actually don't work for CSVs with quoted strings. An alternative is a module for python that provides the command line tool csvfilter. It works like cut, but properly handles CSV column quoting:

csvfilter -f 1,3,5 in.csv > out.csv

If you have python (and you should), you can install it simply like this:

pip install csvfilter

Please take note that the column indexing in csvfilter starts with 0 (unlike awk, which starts with $1). More info at https://github.com/codeinthehole/csvfilter/

回复收藏 0 原文

剧终人散尽 2024-12-18 10:30:40

其他语言对字段编号范围有快捷方式，但 awk 没有，您必须根据自己的恐惧编写代码；-)

awk -F, 'BEGIN {OFS=","} { print $1, $2, $3, $4 ..... $30, $33}' infile.csv > outfile.csv

awk 中没有直接函数来使用字段名称作为列说明符。

我希望这有帮助。

Other languages have short cuts for ranges of field numbers, but not awk, you'll have to write your code as your fear ;-)

awk -F, 'BEGIN {OFS=","} { print $1, $2, $3, $4 ..... $30, $33}' infile.csv > outfile.csv

There is no direct function in awk to use field names as column specifiers.

I hope this helps.

回复收藏 0 原文

我不是你的备胎 2024-12-18 10:30:40

您可以使用 for 循环来寻址带有 $i 的字段：

ls -l | awk '{for(i=3 ; i<8 ; i++) {printf("%s\t", $i)} print ""}'

You can use a for-loop to address a field with $i:

ls -l | awk '{for(i=3 ; i<8 ; i++) {printf("%s\t", $i)} print ""}'

回复收藏 0 原文

沉鱼一梦 2024-12-18 10:30:40

其他人已经回答了你之前的问题。为了这：

作为附录，有没有办法直接使用标题名称而不是列号来提取？

我还没有尝试过，但是您可以将每个标头的索引存储在哈希中，然后使用该哈希来获取其索引。

for(i=0;i<$NF;i++){
    hash[$i] = i;
}

然后稍后使用它：

j = hash["header1"];
print $j;

Others have answered your earlier question. For this:

As an addendum, is there any way to extract directly with the header names rather than with column numbers?

I haven't tried it, but you could store each header's index in a hash and then use that hash to get its index later on.

for(i=0;i<$NF;i++){
    hash[$i] = i;
}

Then later on, use it:

j = hash["header1"];
print $j;

回复收藏 0 原文

笑叹一世浮沉 2024-12-18 10:30:40

Tabulator 是一组 unix 命令行工具，用于处理具有标题行的 csv 文件。下面是一个从文件 test.csv: 中按名称提取列的示例

name,sex,house_nr,height,shoe_size
arthur,m,42,181,11.5
berta,f,101,163,8.5
chris,m,1333,175,10
don,m,77,185,12.5
elisa,f,204,166,7

：然后 tblmap -k name,height test.csv 生成

name,height
arthur,181
berta,163
chris,175
don,185
elisa,166

Tabulator is a set of unix command line tools to work with csv files that have header lines. Here is an example to extract columns by name from a file test.csv:

name,sex,house_nr,height,shoe_size
arthur,m,42,181,11.5
berta,f,101,163,8.5
chris,m,1333,175,10
don,m,77,185,12.5
elisa,f,204,166,7

Then tblmap -k name,height test.csv produces

name,height
arthur,181
berta,163
chris,175
don,185
elisa,166

回复收藏 0 原文

跨年 2024-12-18 10:30:40

如果 Perl 是一个选项：

perl -F, -lane 'print join ",",@F[0,1,2,3,4,5,6,7,8,9,19,20,21 ,22,23,24,29,32]'

-a 自动将行分割成 @F 字段数组。索引从 0 开始（不是 awk 中的 1）
-F, 字段分隔符是，

如果您的 CSV 文件在引号内包含逗号，则成熟的 CSV 解析器（例如 Perl 的 Text::CSV_XS）是专门为处理这种类型而构建的怪异。

perl -MText::CSV_XS -lne 'BEGIN{$csv=Text::CSV_XS->new()} if($csv->parse($_)){@f=$csv->;字段（）；打印（连接",",@f[0,1,2,3,4,5,6,7,8,9,19,20,21,22,23,24,29,32])}'

我在此处的回答中提供了更多解释：使用 gawk 解析 csv 文件

回复收藏 0 原文

一念一轮回 2024-12-18 10:30:40

不使用 awk，但我能够完成此操作的最简单方法是仅使用 csv工具。我还有其他用例来使用 csvtool，如果引号或分隔符出现在列数据本身中，它可以适当地处理它们。

csvtool format '%(2)\n' input.csv
csvtool format '%(2),%(3),%(4)\n' input.csv

将 2 替换为列号将有效提取您要查找的列数据。

Not using awk but the simplest way I was able to get this done was to just use csvtool. I had other use cases as well to use csvtool and it can handle the quotes or delimiters appropriately if they appear within the column data itself.

csvtool format '%(2)\n' input.csv
csvtool format '%(2),%(3),%(4)\n' input.csv

Replacing 2 with the column number will effectively extract the column data you are looking for.

回复收藏 0 原文

慢慢从新开始 2024-12-18 10:30:40

您可以从 AWK 外部传递要使用其值的列。

例如，使用 GNU Awk 3.1.7

此代码

echo -e "one,two,three,four,five\none,two,three,four,five" | awk -F"," -v kfields="1_3_5" '
BEGIN {
   arrayMax=split(kfields, arrKeys, "_");
}
{
   outString="";
   for (idx = 1; idx <= arrayMax ; idx++) {
     outString=outString$arrKeys[idx];
   }
   print "outString:"outString;
   print "-----------";
}
'

这将仅输出您指定的列，例如这些值仅表示输出字段 1、3 和 5。

kfields="1_3_5"

例如输出

outString:onethreefive
-----------
outString:onethreefive
-----------

You can pass the columns whose values you want to use from outside of AWK.

Eg, using GNU Awk 3.1.7

This Code

echo -e "one,two,three,four,five\none,two,three,four,five" | awk -F"," -v kfields="1_3_5" '
BEGIN {
   arrayMax=split(kfields, arrKeys, "_");
}
{
   outString="";
   for (idx = 1; idx <= arrayMax ; idx++) {
     outString=outString$arrKeys[idx];
   }
   print "outString:"outString;
   print "-----------";
}
'

This will output only the columns you specify eg these values mean only output fields 1 and 3 and 5.

kfields="1_3_5"

Eg the output

outString:onethreefive
-----------
outString:onethreefive
-----------

回复收藏 0 原文

~没有更多了~

关于作者

多像笑话

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

使用 Awk 从分隔文件中提取特定列

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（9）

关于作者

相关话题

热门标签

推荐作者

醉城メ夜风

远昼

平生欢

微凉

Honwey

qq_ikhFfg

友情链接

使用 Awk 从分隔文件中提取特定列

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（9）

关于作者

相关话题

热门标签

推荐作者

醉城メ夜风

远昼

平生欢

微凉

Honwey

qq_ikhFfg

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。