如何使用 awk 解析逗号分隔值 (csv)?
我正在尝试编写一个 awk 脚本,将 CSV 格式的电子表格转换为 XML,以解决 Bugzilla 错误。输入 CSV 的格式如下(从 XLS 电子表格创建并保存为 CSV):
tag_1,tag_2,...,tag_N
value1_1,value1_2,...,value1_N
value2_1,value2_2,...,value2_N
valueM_1,valueM_2,...,valueM_N
标题列代表 XML 标记的名称。转换为 XML 的上述文件应如下所示:
<element>
<tag_1>value1_1</tag_1>
<tag_2>value1_2</tag_2>
...
<tag_N>value1_N</tag_N>
</element>
<element>
<tag_1>value2_1</tag_1>
<tag_2>value2_2</tag_2>
...
<tag_N>value2_N</tag_N>
</element>
...
我必须完成此操作的 awk 脚本如下:
BEGIN {OFS = "\n"}
NR == 1 {for (i = 1; i <=NF; i++)
tag[i]=$i
print "<bugzilla version=\"3.4.1\" urlbase=\"http://mozilla.com/\" maintainer=\"[email protected]\" exporter=\"[email protected]\">"}
NR != 1 {print " <bug>"
for (i = 1; i <= NF; i++)
print " <" tag[i] ">" $i "</" tag[i] ">"
print " </bug>"}
END {print "</bugzilla>"}
实际的 CSV 文件是:
cf_foo,cf_bar,short_desc,cf_zebra,cf_pizza,cf_dumpling ,assigned_to,bug_status,cf_word,cf_caslte
ABCD,A-BAR-0032,A NICE DESCRIPTION - help me,pretty,Pepperoni,,,NEW,,
实际输出是:
$ awk -f csvtobugs.awk bugs.csv
<bugzilla version="3.4.1" urlbase="http://mozilla.com/" maintainer="[email protected]" exporter="[email protected]">
<bug>
<cf_foo,cf_bar,short_desc,cf_zebra,cf_pizza,cf_dumpling>ABCD,A-BAR-0032,A</cf_foo,cf_bar,short_desc,cf_zebra,cf_pizza,cf_dumpling>
<,assigned_to,bug_status,cf_word,cf_caslte>NICE</,assigned_to,bug_status,cf_word,cf_caslte>
<>DESCRIPTION</>
<>-</>
<>help</>
<>me,pretty,Pepperoni,,,NEW,,</>
</bug>
<bug>
</bug>
</bugzilla>
显然,不是预期的结果(我承认,我从这个论坛复制粘贴了这个脚本: http: //www.unix.com/shell-programming-scripting/21404-csv-xml.html)。问题是自从我查看 awk 脚本以来已经很久了,而且我不知道语法的含义。
I am trying to write an awk script to convert a CSV formatted spreadsheet into XML for Bugzilla bugs. The format of the input CSV is as follows (created from an XLS spreadsheet and saved as CSV):
tag_1,tag_2,...,tag_N
value1_1,value1_2,...,value1_N
value2_1,value2_2,...,value2_N
valueM_1,valueM_2,...,valueM_N
The header column represents the name of the XML tag. The above file converted to XML should look as follows:
<element>
<tag_1>value1_1</tag_1>
<tag_2>value1_2</tag_2>
...
<tag_N>value1_N</tag_N>
</element>
<element>
<tag_1>value2_1</tag_1>
<tag_2>value2_2</tag_2>
...
<tag_N>value2_N</tag_N>
</element>
...
The awk script I have to accomplish this follows:
BEGIN {OFS = "\n"}
NR == 1 {for (i = 1; i <=NF; i++)
tag[i]=$i
print "<bugzilla version=\"3.4.1\" urlbase=\"http://mozilla.com/\" maintainer=\"[email protected]\" exporter=\"[email protected]\">"}
NR != 1 {print " <bug>"
for (i = 1; i <= NF; i++)
print " <" tag[i] ">" $i "</" tag[i] ">"
print " </bug>"}
END {print "</bugzilla>"}
The actual CSV file is:
cf_foo,cf_bar,short_desc,cf_zebra,cf_pizza,cf_dumpling ,assigned_to,bug_status,cf_word,cf_caslte
ABCD,A-BAR-0032,A NICE DESCRIPTION - help me,pretty,Pepperoni,,,NEW,,
The actual output is:
$ awk -f csvtobugs.awk bugs.csv
<bugzilla version="3.4.1" urlbase="http://mozilla.com/" maintainer="[email protected]" exporter="[email protected]">
<bug>
<cf_foo,cf_bar,short_desc,cf_zebra,cf_pizza,cf_dumpling>ABCD,A-BAR-0032,A</cf_foo,cf_bar,short_desc,cf_zebra,cf_pizza,cf_dumpling>
<,assigned_to,bug_status,cf_word,cf_caslte>NICE</,assigned_to,bug_status,cf_word,cf_caslte>
<>DESCRIPTION</>
<>-</>
<>help</>
<>me,pretty,Pepperoni,,,NEW,,</>
</bug>
<bug>
</bug>
</bugzilla>
Clearly, not the intended result (I admit, I copy-pasted this script from this forum: http://www.unix.com/shell-programming-scripting/21404-csv-xml.html). The problem is that it's been SOOOOO long since I've looked at awk scripts and I have NO IDEA what the syntax means.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
您需要在
BEGIN
规则中设置FS = ","
以使用逗号作为字段分隔符;如果字段分隔符是制表符,那么您显示的代码应该可以工作,这是文件中的一种不同(也很流行)的约定,即使不使用逗号,这些文件通常仍称为“CSV”;-)。You need to set
FS = ","
in theBEGIN
rule to use comma as your field separator; the code as you show it should work if the field separator was a tab, which is a different (also popular) convention in files that are often still called "CSV" even then commas aren't used;-).使用您确实知道的工具:)
那个 awk 脚本看起来并不处理 " 和其他 CSV 奇怪的问题。(我认为它只是在选项卡上拆分 - 正如其他答案指出的那样,它需要更改为在 , 上拆分)python, perl .Net 等具有完全处理 CSV 和 XML 的对象,您可能可以用与 awk 脚本一样少的字符编写解决方案,更重要的是理解它。
Use a tool that you do know:)
That awk script does not look it deals with " and other CSV oddities. (I think it just splits on tabs - as the other answers note it needs to be change to split on , ) python, perl .Net etc have objects to fully deal with CSV and XML and probably you could write the solution in as few characters as the awk script and MORE importantly understand it.
请记住,在 csv 中用逗号分割是可以的,直到出现以下情况:
在这种情况下,它将把“超级豪华卡车”分割成两个项目,这是不正确的。我建议使用另一种语言的 csv 库,如上面帖子中的“Mark”所述。
Remember that splitting by comma in a csv is fine until you get the following scenario:
In which case it will split "Super, luxurious truck" into two items which is incorrect. I would recommend using the csv libs in another language as 'Mark' states in the above post.
我可以通过更改 FS(字段分隔符)来修复它:
输出:
I was able to fix it by changing the FS (field separator):
Output:
您可以使用各种技巧,例如设置 FS。更多技巧可以在 Awk 新闻组中找到。还有像我这样的解析器: http://lorance.freeshell.org/csv/
You can use various tricks like setting FS. More tricks can be found at the Awk news group. There are also parsers like mine: http://lorance.freeshell.org/csv/
您可以尝试使用我的 csvprintf。它可以将 CSV 转换为 XML,然后您可以根据需要使用 XSLT 对其进行样式设置。
You might try my csvprintf instead. It can convert CSV to XML, which you can then style with XSLT as desired.