使用 gawk 解析 CSV 文件
如何使用 gawk 解析 CSV 文件? 简单地设置 FS=","
是不够的,因为内部带有逗号的带引号的字段将被视为多个字段。
使用 FS=","
的示例不起作用:
文件内容:
one,two,"three, four",five
"six, seven",eight,"nine"
gawk 脚本:
BEGIN { FS="," }
{
for (i=1; i<=NF; i++) printf "field #%d: %s\n", i, $(i)
printf "---------------------------\n"
}
错误输出:
field #1: one
field #2: two
field #3: "three
field #4: four"
field #5: five
---------------------------
field #1: "six
field #2: seven"
field #3: eight
field #4: "nine"
---------------------------
所需输出:
field #1: one
field #2: two
field #3: "three, four"
field #4: five
---------------------------
field #1: "six, seven"
field #2: eight
field #3: "nine"
---------------------------
How do you parse a CSV file using gawk? Simply setting FS=","
is not enough, as a quoted field with a comma inside will be treated as multiple fields.
Example using FS=","
which does not work:
file contents:
one,two,"three, four",five
"six, seven",eight,"nine"
gawk script:
BEGIN { FS="," }
{
for (i=1; i<=NF; i++) printf "field #%d: %s\n", i, $(i)
printf "---------------------------\n"
}
bad output:
field #1: one
field #2: two
field #3: "three
field #4: four"
field #5: five
---------------------------
field #1: "six
field #2: seven"
field #3: eight
field #4: "nine"
---------------------------
desired output:
field #1: one
field #2: two
field #3: "three, four"
field #4: five
---------------------------
field #1: "six, seven"
field #2: eight
field #3: "nine"
---------------------------
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
gawk 版本 4 手册 说使用 < code>FPAT = "([^,]*)|(\"[^\"]+\")"
当定义
FPAT
时,它会禁用FS 并按内容而不是分隔符指定字段。
The gawk version 4 manual says to use
FPAT = "([^,]*)|(\"[^\"]+\")"
When
FPAT
is defined, it disablesFS
and specifies fields by content instead of by separator.简短的回答是“如果 CSV 包含尴尬的数据,我不会使用 gawk 来解析 CSV”,其中“awkward”意味着 CSV 字段数据中的逗号之类的内容。
下一个问题是“您还要进行哪些其他处理”,因为这将影响您使用的替代方案。
我可能会使用 Perl 和 Text::CSV 或 Text::CSV_XS 模块来读取和处理数据。 请记住,Perl 最初部分是作为
awk
和sed
杀手级程序编写的 - 因此a2p
和s2p
程序仍然存在与 Perl 一起分发,将awk
和sed
脚本(分别)转换为 Perl。The short answer is "I wouldn't use gawk to parse CSV if the CSV contains awkward data", where 'awkward' means things like commas in the CSV field data.
The next question is "What other processing are you going to be doing", since that will influence what alternatives you use.
I'd probably use Perl and the Text::CSV or Text::CSV_XS modules to read and process the data. Remember, Perl was originally written in part as an
awk
andsed
killer - hence thea2p
ands2p
programs still distributed with Perl which convertawk
andsed
scripts (respectively) into Perl.您可以使用一个名为 csvquote 的简单包装函数来清理输入并在 awk 处理完后恢复它。 在开始和结束时通过它传输数据,一切都应该正常:
之前:
之后:
请参阅 https:// github.com/dbro/csvquote 获取代码和文档。
You can use a simple wrapper function called csvquote to sanitize the input and restore it after awk is done processing it. Pipe your data through it at the start and end, and everything should work out ok:
before:
after:
See https://github.com/dbro/csvquote for code and documentation.
如果允许,我会使用Python csv 模块,特别注意使用的方言和所需的格式参数,解析您拥有的 CSV 文件。
If permissible, I would use the Python csv module, paying special attention to the dialect used and formatting parameters required, to parse the CSV file you have.
csv2delim.awk
测试.csv
测试.bat
csv2delim.awk
test.csv
test.bat
我不太确定这是否是正确的做事方式。 我宁愿处理一个 csv 文件,其中所有值都被引用或没有。 顺便说一句,awk 允许正则表达式作为字段分隔符。 检查一下是否有用。
I am not exactly sure whether this is the right way to do things. I would rather work on a csv file in which either all values are to quoted or none. Btw, awk allows regexes to be Field Separators. Check if that is useful.
遵循这一模式的模式可以访问 Column[] 中的字段。 ColumnCount 指示 Column[] 中找到的元素数。 如果并非所有行都包含相同数量的列,则在处理较短的行时,Column[] 会在 Column[ColumnCount] 之后包含额外的数据。
此实现速度很慢,但它似乎模拟了之前答案中提到的 gawk >= 4.0.0 中的
FPAT
/patsplit()
功能。参考
Patterns that follow this one can access the fields in Column[]. ColumnCount indicates the number of elements in Column[] that were found. If not all rows contain the same number of columns, Column[] contains extra data after Column[ColumnCount] when processing the shorter rows.
This implementation is slow, but it appears to emulate the
FPAT
/patsplit()
feature found in gawk >= 4.0.0 mentioned in a previous answer.Reference
这就是我的想法。 任何意见和/或更好的解决方案将不胜感激。
基本思想是我循环遍历字段,任何以引号开头但不以引号结尾的字段都会附加到其后的下一个字段。
Here's what I came up with. Any comments and/or better solutions would be appreciated.
The basic idea is that I loop through the fields, and any field which starts with a quote but does not end with a quote gets the next field appended to it.
Perl 有 Text::CSV_XS 模块,该模块是专门为处理引号逗号的怪异而构建的。
或者尝试 Text::CSV 模块。
perl -MText::CSV_XS -ne 'BEGIN{$csv=Text::CSV_XS->new()} if($csv->parse($_)){@f=$csv->; fields();for $n (0..$#f) {print "field #$n: $f[$n]\n"};print "---\n"}' file.csv
产生以下输出:
这是一个人类可读的版本。
将其另存为 parsecsv,chmod +x,并将其运行为“parsecsv file.csv”
您可能需要指向计算机上不同版本的 perl,因为 Text::CSV_XS 模块可能未安装在您的默认版本上珀尔。
如果您的 Perl 版本都没有安装 Text::CSV_XS,您需要:
sudo apt-get install cpanminus
sudo cpanm Text::CSV_XS
Perl has the Text::CSV_XS module which is purpose-built to handle the quoted-comma weirdness.
Alternately try the Text::CSV module.
perl -MText::CSV_XS -ne 'BEGIN{$csv=Text::CSV_XS->new()} if($csv->parse($_)){@f=$csv->fields();for $n (0..$#f) {print "field #$n: $f[$n]\n"};print "---\n"}' file.csv
Produces this output:
Here's a human-readable version.
Save it as parsecsv, chmod +x, and run it as "parsecsv file.csv"
You may need to point to a different version of perl on your machine, since the Text::CSV_XS module may not be installed on your default version of perl.
If none of your versions of Perl have Text::CSV_XS installed, you'll need to:
sudo apt-get install cpanminus
sudo cpanm Text::CSV_XS