如何使用 awk 解析逗号分隔值 (csv)？

发布于 2024-08-05 08:01:21 字数 2793 浏览 4 评论 0原文

我正在尝试编写一个 awk 脚本，将 CSV 格式的电子表格转换为 XML，以解决 Bugzilla 错误。输入 CSV 的格式如下（从 XLS 电子表格创建并保存为 CSV）：

tag_1,tag_2,...,tag_N
value1_1,value1_2,...,value1_N
value2_1,value2_2,...,value2_N
valueM_1,valueM_2,...,valueM_N

标题列代表 XML 标记的名称。转换为 XML 的上述文件应如下所示：

<element>
    <tag_1>value1_1</tag_1>
    <tag_2>value1_2</tag_2>
    ...
    <tag_N>value1_N</tag_N>
</element>
<element>
    <tag_1>value2_1</tag_1>
    <tag_2>value2_2</tag_2>
    ...
    <tag_N>value2_N</tag_N>
</element>
...

我必须完成此操作的 awk 脚本如下：

BEGIN {OFS = "\n"}
NR == 1 {for (i = 1; i <=NF; i++)
            tag[i]=$i
         print "<bugzilla version=\"3.4.1\" urlbase=\"http://mozilla.com/\" maintainer=\"[email protected]\" exporter=\"[email protected]\">"}
NR != 1 {print "   <bug>"
         for (i = 1; i <= NF; i++)
            print "      <" tag[i] ">" $i "</" tag[i] ">"
         print "   </bug>"}
END {print "</bugzilla>"}

实际的 CSV 文件是：

cf_foo,cf_bar,short_desc,cf_zebra,cf_pizza,cf_dumpling ,assigned_to,bug_status,cf_word,cf_caslte
ABCD,A-BAR-0032,A NICE DESCRIPTION - help me,pretty,Pepperoni,,,NEW,,

实际输出是：

$ awk -f csvtobugs.awk bugs.csv

<bugzilla version="3.4.1" urlbase="http://mozilla.com/" maintainer="[email protected]" exporter="[email protected]">
   <bug>
      <cf_foo,cf_bar,short_desc,cf_zebra,cf_pizza,cf_dumpling>ABCD,A-BAR-0032,A</cf_foo,cf_bar,short_desc,cf_zebra,cf_pizza,cf_dumpling>
      <,assigned_to,bug_status,cf_word,cf_caslte>NICE</,assigned_to,bug_status,cf_word,cf_caslte>
      <>DESCRIPTION</>
      <>-</>
      <>help</>
      <>me,pretty,Pepperoni,,,NEW,,</>
   </bug>
   <bug>
   </bug>
</bugzilla>

显然，不是预期的结果（我承认，我从这个论坛复制粘贴了这个脚本： http: //www.unix.com/shell-programming-scripting/21404-csv-xml.html）。问题是自从我查看 awk 脚本以来已经很久了，而且我不知道语法的含义。

原文

I am trying to write an awk script to convert a CSV formatted spreadsheet into XML for Bugzilla bugs. The format of the input CSV is as follows (created from an XLS spreadsheet and saved as CSV):

tag_1,tag_2,...,tag_N
value1_1,value1_2,...,value1_N
value2_1,value2_2,...,value2_N
valueM_1,valueM_2,...,valueM_N

The header column represents the name of the XML tag. The above file converted to XML should look as follows:

<element>
    <tag_1>value1_1</tag_1>
    <tag_2>value1_2</tag_2>
    ...
    <tag_N>value1_N</tag_N>
</element>
<element>
    <tag_1>value2_1</tag_1>
    <tag_2>value2_2</tag_2>
    ...
    <tag_N>value2_N</tag_N>
</element>
...

The awk script I have to accomplish this follows:

BEGIN {OFS = "\n"}
NR == 1 {for (i = 1; i <=NF; i++)
            tag[i]=$i
         print "<bugzilla version=\"3.4.1\" urlbase=\"http://mozilla.com/\" maintainer=\"[email protected]\" exporter=\"[email protected]\">"}
NR != 1 {print "   <bug>"
         for (i = 1; i <= NF; i++)
            print "      <" tag[i] ">" $i "</" tag[i] ">"
         print "   </bug>"}
END {print "</bugzilla>"}

The actual CSV file is:

cf_foo,cf_bar,short_desc,cf_zebra,cf_pizza,cf_dumpling ,assigned_to,bug_status,cf_word,cf_caslte
ABCD,A-BAR-0032,A NICE DESCRIPTION - help me,pretty,Pepperoni,,,NEW,,

The actual output is:

$ awk -f csvtobugs.awk bugs.csv

<bugzilla version="3.4.1" urlbase="http://mozilla.com/" maintainer="[email protected]" exporter="[email protected]">
   <bug>
      <cf_foo,cf_bar,short_desc,cf_zebra,cf_pizza,cf_dumpling>ABCD,A-BAR-0032,A</cf_foo,cf_bar,short_desc,cf_zebra,cf_pizza,cf_dumpling>
      <,assigned_to,bug_status,cf_word,cf_caslte>NICE</,assigned_to,bug_status,cf_word,cf_caslte>
      <>DESCRIPTION</>
      <>-</>
      <>help</>
      <>me,pretty,Pepperoni,,,NEW,,</>
   </bug>
   <bug>
   </bug>
</bugzilla>

Clearly, not the intended result (I admit, I copy-pasted this script from this forum: http://www.unix.com/shell-programming-scripting/21404-csv-xml.html). The problem is that it's been SOOOOO long since I've looked at awk scripts and I have NO IDEA what the syntax means.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

痴情换悲伤 2024-08-12 08:01:21

您需要在 BEGIN 规则中设置 FS = "," 以使用逗号作为字段分隔符；如果字段分隔符是制表符，那么您显示的代码应该可以工作，这是文件中的一种不同（也很流行）的约定，即使不使用逗号，这些文件通常仍称为“CSV”；-)。

回复收藏 0 原文

掀纱窥君容 2024-08-12 08:01:21

使用您确实知道的工具:)

那个 awk 脚本看起来并不处理 " 和其他 CSV 奇怪的问题。（我认为它只是在选项卡上拆分 - 正如其他答案指出的那样，它需要更改为在 , 上拆分）python, perl .Net 等具有完全处理 CSV 和 XML 的对象，您可能可以用与 awk 脚本一样少的字符编写解决方案，更重要的是理解它。

回复收藏 0 原文

江湖正好 2024-08-12 08:01:21

请记住，在 csv 中用逗号分割是可以的，直到出现以下情况：

1997,Ford,E350,"Super, luxurious truck"

在这种情况下，它将把“超级豪华卡车”分割成两个项目，这是不正确的。我建议使用另一种语言的 csv 库，如上面帖子中的“Mark”所述。

Remember that splitting by comma in a csv is fine until you get the following scenario:

1997,Ford,E350,"Super, luxurious truck"

In which case it will split "Super, luxurious truck" into two items which is incorrect. I would recommend using the csv libs in another language as 'Mark' states in the above post.

回复收藏 0 原文

逆光下的微笑 2024-08-12 08:01:21

我可以通过更改 FS（字段分隔符）来修复它：

BEGIN {
    FS=",";
    OFS = "\n"}
NR == 1 {for (i = 1; i <=NF; i++)
            tag[i]=$i
         print "<bugzilla version=\"3.4.1\" urlbase=\"http://mozilla.com/\" maintainer=\"[email protected]\" exporter=\"[email protected]\">"}
NR != 1 {print "   <bug>"
         for (i = 1; i <= NF; i++)
            print "      <" tag[i] ">" $i "</" tag[i] ">"
         print "   </bug>"}
END {print "</bugzilla>"}

输出：

<bugzilla version="3.4.1" urlbase="http://mozilla.com/" maintainer="[email protected]" exporter="[email protected]">
   <bug>
      <cf_foo>ABCD</cf_foo>
      <cf_bar>A-BAR-0032</cf_bar>
      <short_desc>A NICE DESCRIPTION - help me</short_desc>
      <cf_zebra>pretty</cf_zebra>
      <cf_pizza>Pepperoni</cf_pizza>
      <cf_dumpling ></cf_dumpling >
      <assigned_to></assigned_to>
      <bug_status>NEW</bug_status>
      <cf_word></cf_word>
      <cf_caslte></cf_caslte>
   </bug>
</bugzilla>

I was able to fix it by changing the FS (field separator):

BEGIN {
    FS=",";
    OFS = "\n"}
NR == 1 {for (i = 1; i <=NF; i++)
            tag[i]=$i
         print "<bugzilla version=\"3.4.1\" urlbase=\"http://mozilla.com/\" maintainer=\"[email protected]\" exporter=\"[email protected]\">"}
NR != 1 {print "   <bug>"
         for (i = 1; i <= NF; i++)
            print "      <" tag[i] ">" $i "</" tag[i] ">"
         print "   </bug>"}
END {print "</bugzilla>"}

Output:

<bugzilla version="3.4.1" urlbase="http://mozilla.com/" maintainer="[email protected]" exporter="[email protected]">
   <bug>
      <cf_foo>ABCD</cf_foo>
      <cf_bar>A-BAR-0032</cf_bar>
      <short_desc>A NICE DESCRIPTION - help me</short_desc>
      <cf_zebra>pretty</cf_zebra>
      <cf_pizza>Pepperoni</cf_pizza>
      <cf_dumpling ></cf_dumpling >
      <assigned_to></assigned_to>
      <bug_status>NEW</bug_status>
      <cf_word></cf_word>
      <cf_caslte></cf_caslte>
   </bug>
</bugzilla>

回复收藏 0 原文