使用 Awk 从分隔文件中提取特定列
抱歉,如果这太基础了。我有一个 csv 文件,其中的列有标题行(v1、v2 等)。我知道要提取第 1 列和第 2 列,我必须执行以下操作: awk -F "," '{print $1 "," $2}' infile.csv > outfile.csv
。但是,如果我必须提取第 1 至 10、20 至 25 以及 30、33 列,该怎么办?作为附录,有没有办法直接使用标题名称而不是列号来提取?
Sorry if this is too basic. I have a csv file where the columns have a header row (v1, v2, etc.). I understand that to extract columns 1 and 2, I have to do: awk -F "," '{print $1 "," $2}' infile.csv > outfile.csv
. But what if I have to extract, say, columns 1 to 10, 20 to 25, and 30, 33? As an addendum, is there any way to extract directly with the header names rather than with column numbers?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
我不知道是否可以在 awk 中进行范围操作。您可以执行 for 循环,但必须添加处理来过滤掉不需要的列。这样做可能更容易:
还有其他需要考虑的事情 - 而且更快更简洁:
至于你问题的第二部分,我可能会用 perl 编写一个脚本,它知道如何处理标题行,从 stdin 解析列名称或文件,然后进行过滤。它可能是我想要用于其他事情的工具。我不确定是否可以用单线来做,尽管我确信可以做到。
I don't know if it's possible to do ranges in awk. You could do a for loop, but you would have to add handling to filter out the columns you don't want. It's probably easier to do this:
something else to consider - and this faster and more concise:
As to the second part of your question, I would probably write a script in perl that knows how to handle header rows, parsing the columns names from stdin or a file and then doing the filtering. It's probably a tool I would want to have for other things. I am not sure about doing in a one liner, although I am sure it can be done.
正如@Tom 提到的,cut 和 awk 方法实际上不适用于带引号字符串的 CSV。另一种选择是 python 模块,它提供命令行工具 csvfilter。它的工作方式类似于 cut,但可以正确处理 CSV 列引用:
如果您有 python(并且您应该),您可以像这样简单地安装它:
请注意 csvfilter 中的列索引以 0 开头(与 awk 不同,awk 以 $1 开头) )。更多信息请访问 https://github.com/codeinthehole/csvfilter/
As mentioned by @Tom, the cut and awk approaches actually don't work for CSVs with quoted strings. An alternative is a module for python that provides the command line tool csvfilter. It works like cut, but properly handles CSV column quoting:
If you have python (and you should), you can install it simply like this:
Please take note that the column indexing in csvfilter starts with 0 (unlike awk, which starts with $1). More info at https://github.com/codeinthehole/csvfilter/
其他语言对字段编号范围有快捷方式,但 awk 没有,您必须根据自己的恐惧编写代码;-)
awk 中没有直接函数来使用字段名称作为列说明符。
我希望这有帮助。
Other languages have short cuts for ranges of field numbers, but not awk, you'll have to write your code as your fear ;-)
There is no direct function in awk to use field names as column specifiers.
I hope this helps.
您可以使用 for 循环来寻址带有 $i 的字段:
You can use a for-loop to address a field with $i:
其他人已经回答了你之前的问题。为了这:
我还没有尝试过,但是您可以将每个标头的索引存储在哈希中,然后使用该哈希来获取其索引。
然后稍后使用它:
Others have answered your earlier question. For this:
I haven't tried it, but you could store each header's index in a hash and then use that hash to get its index later on.
Then later on, use it:
Tabulator 是一组 unix 命令行工具,用于处理具有标题行的 csv 文件。下面是一个从文件
test.csv:
中按名称提取列的示例:然后
tblmap -k name,height test.csv
生成Tabulator is a set of unix command line tools to work with csv files that have header lines. Here is an example to extract columns by name from a file
test.csv:
Then
tblmap -k name,height test.csv
produces如果 Perl 是一个选项:
perl -F, -lane 'print join ",",@F[0,1,2,3,4,5,6,7,8,9,19,20,21 ,22,23,24,29,32]'
-a
自动将行分割成@F
字段数组。索引从 0 开始(不是 awk 中的 1)-F,
字段分隔符是 ,如果您的 CSV 文件在引号内包含逗号,则成熟的 CSV 解析器(例如 Perl 的
Text::CSV_XS
)是专门为处理这种类型而构建的怪异。perl -MText::CSV_XS -lne 'BEGIN{$csv=Text::CSV_XS->new()} if($csv->parse($_)){@f=$csv->;字段();打印(连接",",@f[0,1,2,3,4,5,6,7,8,9,19,20,21,22,23,24,29,32])}'
我在此处的回答中提供了更多解释:使用 gawk 解析 csv 文件
If Perl is an option:
perl -F, -lane 'print join ",",@F[0,1,2,3,4,5,6,7,8,9,19,20,21,22,23,24,29,32]'
-a
autosplits line into@F
fields array. Indices start at 0 (not 1 as in awk)-F,
field separator is ,If your CSV file contains commas within quotes, fully fledged CSV parsers such as Perl's
Text::CSV_XS
are purpose-built to handle that kind of weirdness.perl -MText::CSV_XS -lne 'BEGIN{$csv=Text::CSV_XS->new()} if($csv->parse($_)){@f=$csv->fields();print (join ",",@f[0,1,2,3,4,5,6,7,8,9,19,20,21,22,23,24,29,32])}'
I provided more explanation within my answer here: parse csv file using gawk
不使用 awk,但我能够完成此操作的最简单方法是仅使用 csv工具。我还有其他用例来使用 csvtool,如果引号或分隔符出现在列数据本身中,它可以适当地处理它们。
将 2 替换为列号将有效提取您要查找的列数据。
Not using awk but the simplest way I was able to get this done was to just use csvtool. I had other use cases as well to use csvtool and it can handle the quotes or delimiters appropriately if they appear within the column data itself.
Replacing 2 with the column number will effectively extract the column data you are looking for.
您可以从 AWK 外部传递要使用其值的列。
例如,使用 GNU Awk 3.1.7
此代码
这将仅输出您指定的列,例如这些值仅表示输出字段 1、3 和 5。
例如输出
You can pass the columns whose values you want to use from outside of AWK.
Eg, using GNU Awk 3.1.7
This Code
This will output only the columns you specify eg these values mean only output fields 1 and 3 and 5.
Eg the output