使用 Awk 从分隔文件中提取特定列

发布于 2024-12-11 10:30:40 字数 219 浏览 0 评论 0原文

抱歉,如果这太基础了。我有一个 csv 文件,其中的列有标题行(v1、v2 等)。我知道要提取第 1 列和第 2 列,我必须执行以下操作: awk -F "," '{print $1 "," $2}' infile.csv > outfile.csv。但是,如果我必须提取第 1 至 10、20 至 25 以及 30、33 列,该怎么办?作为附录,有没有办法直接使用标题名称而不是列号来提取?

Sorry if this is too basic. I have a csv file where the columns have a header row (v1, v2, etc.). I understand that to extract columns 1 and 2, I have to do: awk -F "," '{print $1 "," $2}' infile.csv > outfile.csv. But what if I have to extract, say, columns 1 to 10, 20 to 25, and 30, 33? As an addendum, is there any way to extract directly with the header names rather than with column numbers?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

递刀给你 2024-12-18 10:30:40

我不知道是否可以在 awk 中进行范围操作。您可以执行 for 循环,但必须添加处理来过滤掉不需要的列。这样做可能更容易:

awk -F, '{OFS=",";print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$20,$21,$22,$23,$24,$25,$30,$33}' infile.csv > outfile.csv

还有其他需要考虑的事情 - 而且更快更简洁:

cut -d "," -f1-10,20-25,30-33 infile.csv > outfile.csv

至于你问题的第二部分,我可能会用 perl 编写一个脚本,它知道如何处理标题行,从 stdin 解析列名称或文件,然后进行过滤。它可能是我想要用于其他事情的工具。我不确定是否可以用单线来做,尽管我确信可以做到。

I don't know if it's possible to do ranges in awk. You could do a for loop, but you would have to add handling to filter out the columns you don't want. It's probably easier to do this:

awk -F, '{OFS=",";print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$20,$21,$22,$23,$24,$25,$30,$33}' infile.csv > outfile.csv

something else to consider - and this faster and more concise:

cut -d "," -f1-10,20-25,30-33 infile.csv > outfile.csv

As to the second part of your question, I would probably write a script in perl that knows how to handle header rows, parsing the columns names from stdin or a file and then doing the filtering. It's probably a tool I would want to have for other things. I am not sure about doing in a one liner, although I am sure it can be done.

枕梦 2024-12-18 10:30:40

正如@Tom 提到的,cut 和 awk 方法实际上不适用于带引号字符串的 CSV。另一种选择是 python 模块,它提供命令行工具 csvfilter。它的工作方式类似于 cut,但可以正确处理 CSV 列引用:

csvfilter -f 1,3,5 in.csv > out.csv

如果您有 python(并且您应该),您可以像这样简单地安装它:

pip install csvfilter

请注意 csvfilter 中的列索引以 0 开头(与 awk 不同,awk 以 $1 开头) )。更多信息请访问 https://github.com/codeinthehole/csvfilter/

As mentioned by @Tom, the cut and awk approaches actually don't work for CSVs with quoted strings. An alternative is a module for python that provides the command line tool csvfilter. It works like cut, but properly handles CSV column quoting:

csvfilter -f 1,3,5 in.csv > out.csv

If you have python (and you should), you can install it simply like this:

pip install csvfilter

Please take note that the column indexing in csvfilter starts with 0 (unlike awk, which starts with $1). More info at https://github.com/codeinthehole/csvfilter/

剧终人散尽 2024-12-18 10:30:40

其他语言对字段编号范围有快捷方式,但 awk 没有,您必须根据自己的恐惧编写代码;-)

awk -F, 'BEGIN {OFS=","} { print $1, $2, $3, $4 ..... $30, $33}' infile.csv > outfile.csv

awk 中没有直接函数来使用字段名称作为列说明符。

我希望这有帮助。

Other languages have short cuts for ranges of field numbers, but not awk, you'll have to write your code as your fear ;-)

awk -F, 'BEGIN {OFS=","} { print $1, $2, $3, $4 ..... $30, $33}' infile.csv > outfile.csv

There is no direct function in awk to use field names as column specifiers.

I hope this helps.

我不是你的备胎 2024-12-18 10:30:40

您可以使用 for 循环来寻址带有 $i 的字段:

ls -l | awk '{for(i=3 ; i<8 ; i++) {printf("%s\t", $i)} print ""}'

You can use a for-loop to address a field with $i:

ls -l | awk '{for(i=3 ; i<8 ; i++) {printf("%s\t", $i)} print ""}'
沉鱼一梦 2024-12-18 10:30:40

其他人已经回答了你之前的问题。为了这:

作为附录,有没有办法直接使用标题名称而不是列号来提取?

我还没有尝试过,但是您可以将每个标头的索引存储在哈希中,然后使用该哈希来获取其索引。

for(i=0;i<$NF;i++){
    hash[$i] = i;
}

然后稍后使用它:

j = hash["header1"];
print $j;

Others have answered your earlier question. For this:

As an addendum, is there any way to extract directly with the header names rather than with column numbers?

I haven't tried it, but you could store each header's index in a hash and then use that hash to get its index later on.

for(i=0;i<$NF;i++){
    hash[$i] = i;
}

Then later on, use it:

j = hash["header1"];
print $j;
笑叹一世浮沉 2024-12-18 10:30:40

Tabulator 是一组 unix 命令行工具,用于处理具有标题行的 csv 文件。下面是一个从文件 test.csv: 中按名称提取列的示例

name,sex,house_nr,height,shoe_size
arthur,m,42,181,11.5
berta,f,101,163,8.5
chris,m,1333,175,10
don,m,77,185,12.5
elisa,f,204,166,7

:然后 tblmap -k name,height test.csv 生成

name,height
arthur,181
berta,163
chris,175
don,185
elisa,166

Tabulator is a set of unix command line tools to work with csv files that have header lines. Here is an example to extract columns by name from a file test.csv:

name,sex,house_nr,height,shoe_size
arthur,m,42,181,11.5
berta,f,101,163,8.5
chris,m,1333,175,10
don,m,77,185,12.5
elisa,f,204,166,7

Then tblmap -k name,height test.csv produces

name,height
arthur,181
berta,163
chris,175
don,185
elisa,166
跨年 2024-12-18 10:30:40

如果 Perl 是一个选项:

perl -F, -lane 'print join ",",@F[0,1,2,3,4,5,6,7,8,9,19,20,21 ,22,23,24,29,32]'

-a 自动将行分割成 @F 字段数组。索引从 0 开始(不是 awk 中的 1)
-F, 字段分隔符是 ,

如果您的 CSV 文件在引号内包含逗号,则成熟的 CSV 解析器(例如 Perl 的 Text::CSV_XS)是专门为处理这种类型而构建的怪异。

perl -MText::CSV_XS -lne 'BEGIN{$csv=Text::CSV_XS->new()} if($csv->parse($_)){@f=$csv->;字段();打印(连接",",@f[0,1,2,3,4,5,6,7,8,9,19,20,21,22,23,24,29,32])}'

我在此处的回答中提供了更多解释:使用 gawk 解析 csv 文件

If Perl is an option:

perl -F, -lane 'print join ",",@F[0,1,2,3,4,5,6,7,8,9,19,20,21,22,23,24,29,32]'

-a autosplits line into @F fields array. Indices start at 0 (not 1 as in awk)
-F, field separator is ,

If your CSV file contains commas within quotes, fully fledged CSV parsers such as Perl's Text::CSV_XS are purpose-built to handle that kind of weirdness.

perl -MText::CSV_XS -lne 'BEGIN{$csv=Text::CSV_XS->new()} if($csv->parse($_)){@f=$csv->fields();print (join ",",@f[0,1,2,3,4,5,6,7,8,9,19,20,21,22,23,24,29,32])}'

I provided more explanation within my answer here: parse csv file using gawk

一念一轮回 2024-12-18 10:30:40

不使用 awk,但我能够完成此操作的最简单方法是仅使用 csv工具。我还有其他用例来使用 csvtool,如果引号或分隔符出现在列数据本身中,它可以适当地处理它们。

csvtool format '%(2)\n' input.csv
csvtool format '%(2),%(3),%(4)\n' input.csv

将 2 替换为列号将有效提取您要查找的列数据。

Not using awk but the simplest way I was able to get this done was to just use csvtool. I had other use cases as well to use csvtool and it can handle the quotes or delimiters appropriately if they appear within the column data itself.

csvtool format '%(2)\n' input.csv
csvtool format '%(2),%(3),%(4)\n' input.csv

Replacing 2 with the column number will effectively extract the column data you are looking for.

慢慢从新开始 2024-12-18 10:30:40

您可以从 AWK 外部传递要使用其值的列。

例如,使用 GNU Awk 3.1.7

此代码

echo -e "one,two,three,four,five\none,two,three,four,five" | awk -F"," -v kfields="1_3_5" '
BEGIN {
   arrayMax=split(kfields, arrKeys, "_");
}
{
   outString="";
   for (idx = 1; idx <= arrayMax ; idx++) {
     outString=outString$arrKeys[idx];
   }
   print "outString:"outString;
   print "-----------";
}
'

这将仅输出您指定的列,例如这些值仅表示输出字段 1、3 和 5。

kfields="1_3_5"

例如输出

outString:onethreefive
-----------
outString:onethreefive
-----------

You can pass the columns whose values you want to use from outside of AWK.

Eg, using GNU Awk 3.1.7

This Code

echo -e "one,two,three,four,five\none,two,three,four,five" | awk -F"," -v kfields="1_3_5" '
BEGIN {
   arrayMax=split(kfields, arrKeys, "_");
}
{
   outString="";
   for (idx = 1; idx <= arrayMax ; idx++) {
     outString=outString$arrKeys[idx];
   }
   print "outString:"outString;
   print "-----------";
}
'

This will output only the columns you specify eg these values mean only output fields 1 and 3 and 5.

kfields="1_3_5"

Eg the output

outString:onethreefive
-----------
outString:onethreefive
-----------
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文