解析逗号分隔的 csv 文件的问题

发布于 2024-08-25 19:11:34 字数 268 浏览 5 评论 0原文

我正在尝试使用此命令从 csv 文件中提取第四列(以逗号分隔,并跳过前 2 个标题行),

 awk 'NR <2 {next}{FS =","}{print $4}' filename.csv | more

但是,它不起作用,因为第一列可以包含逗号,因此第四列并不是真正的第四列。以下是行的示例:

“sdfsdfsd,sfsdf”,454,fgdfg,I_want_this_column,sdfgdg,34546,456465等

I am trying to extract 4th column from csv file (comma separated, and skipping first 2 header lines) using this command,

 awk 'NR <2 {next}{FS =","}{print $4}' filename.csv | more

However, it doesn't work because the first column cantains comma, thus 4th column is not really 4th. Below is an example of a row:

"sdfsdfsd, sfsdf", 454,fgdfg, I_want_this_column,sdfgdg,34546, 456465, etc

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

巴黎夜雨 2024-09-01 19:11:34

除非您有使用 awk 的特定原因,否则我建议使用 CSV 解析库。许多脚本语言都有一个内置的(或至少可用的),它们可以帮助您摆脱这些令人头疼的问题。

Unless you have specific reasons for using awk, I would recommend using a CSV parsing library. Many scripting languages have one built-in (or at least available) and they'll save you from these headaches.

浊酒尽余欢 2024-09-01 19:11:34

如果你的第一列总是有引号,

 $ awk 'BEGIN{ FS="\042[ ]*," } { m=split($2,a,","); print a[3] } ' file
 I_want_this_column

如果你想要的列总是最后第二列,

$ awk -F"," '{print $(NF-1)}' file
 I_want_this_column

你可以尝试这个演示脚本来分解列

awk 'BEGIN{ FS="," }
{
   for(i=1;i<=NF;i++){
      # save normal
      if($i !~ /^[ ]*\042|[ ]*\042[ ]*$/){
        a[++j]=$i
      }
      # if quotes at the end
      if(f==1 && $i ~ /[ ]*\042[ ]*$/){
        s=s","$i
        a[++j]=s
        #reset
        s="";f=0
      }
      # if quotes in front
      if($i ~ /^[ ]*\042/){
        s=s $i
        f=1
      }
      if(f==1 && ( $i !~/\042/ ) ){
         s=s","$i
      }
   }
}
END{
  # print columns
  for(p=1;p<=j;p++){
     print "Field "p,": "a[p]
  }
} ' file

输出

$ cat file
"sdfsdfsd, sfsdf", "454,fgdfg blah , words ", I_want_this_column,sdfgdg

$ ./shell.sh
Field 1 : "sdfsdfsd, sfsdf"
Field 2 : fgdfg blah
Field 3 :  "454,fgdfg blah , words "
Field 4 :  I_want_this_column
Field 5 : sdfgdg

if your first column has quotes always,

 $ awk 'BEGIN{ FS="\042[ ]*," } { m=split($2,a,","); print a[3] } ' file
 I_want_this_column

if the column you want is always the last 2nd,

$ awk -F"," '{print $(NF-1)}' file
 I_want_this_column

You can try this demo script to break down the columns

awk 'BEGIN{ FS="," }
{
   for(i=1;i<=NF;i++){
      # save normal
      if($i !~ /^[ ]*\042|[ ]*\042[ ]*$/){
        a[++j]=$i
      }
      # if quotes at the end
      if(f==1 && $i ~ /[ ]*\042[ ]*$/){
        s=s","$i
        a[++j]=s
        #reset
        s="";f=0
      }
      # if quotes in front
      if($i ~ /^[ ]*\042/){
        s=s $i
        f=1
      }
      if(f==1 && ( $i !~/\042/ ) ){
         s=s","$i
      }
   }
}
END{
  # print columns
  for(p=1;p<=j;p++){
     print "Field "p,": "a[p]
  }
} ' file

output

$ cat file
"sdfsdfsd, sfsdf", "454,fgdfg blah , words ", I_want_this_column,sdfgdg

$ ./shell.sh
Field 1 : "sdfsdfsd, sfsdf"
Field 2 : fgdfg blah
Field 3 :  "454,fgdfg blah , words "
Field 4 :  I_want_this_column
Field 5 : sdfgdg
撕心裂肺的伤痛 2024-09-01 19:11:34

你不应该在这里使用 awk。使用 Python csv 模块或 Perl Text::CSV 或 Text::CSV_XS 模块或其他模块真正的 csv 解析器。

相关问题-
使用 gawk 解析 csv 文件

You shouldn't use awk here. Use Python csv module or Perl Text::CSV or Text::CSV_XS modules or another real csv parser.

Related question -
parse csv file using gawk

独享拥抱 2024-09-01 19:11:34

如果您无法避免 awk,这段代码可以完成您需要的工作:

BEGIN {FS=",";}

{
        f=0;
        j=0;
        for (i = 1; i <=NF ; ++i) {
                if (f) {
                        a[j] = a[j] "," $(i);
                        if ($(i) ~ "\"$") {
                                f = 0;
                        }
                }
                else {
                        ++j;
                        a[j] = $(i);
                        if ((a[j] ~ "^\"[^\"]*$")) {
                                f = 1;
                        }
                }
        }
        for (i = 1; i <= j; ++i) {
                gsub("^\"","",a[i]);
                gsub("\"$","",a[i]);
                gsub("\"\"","\"",a[i]);
print "i = \"" a[i] "\"";
        }
}

If you can't avoid awk, this piece of code does the job you need:

BEGIN {FS=",";}

{
        f=0;
        j=0;
        for (i = 1; i <=NF ; ++i) {
                if (f) {
                        a[j] = a[j] "," $(i);
                        if ($(i) ~ "\"$") {
                                f = 0;
                        }
                }
                else {
                        ++j;
                        a[j] = $(i);
                        if ((a[j] ~ "^\"[^\"]*$")) {
                                f = 1;
                        }
                }
        }
        for (i = 1; i <= j; ++i) {
                gsub("^\"","",a[i]);
                gsub("\"$","",a[i]);
                gsub("\"\"","\"",a[i]);
print "i = \"" a[i] "\"";
        }
}
讽刺将军 2024-09-01 19:11:34

使用标准 UNIX 文本工具处理包含带逗号的引号字段的 CSV 文件可能会很困难。

我编写了一个名为 csvquote 的程序,使他们能够轻松处理数据。在您的情况下,您可以像这样使用它:

csvquote filename.csv | awk 'NR <2 {next}{FS =","}{print $4}' | csvquote -u | more

或者您可以像这样使用剪切和尾部:

csvquote filename.csv | tail -n +3 | cut -d, -f4 | csvquote -u | more

代码和文档在这里: https://github.com/dbro/csvquote

Working with CSV files that have quoted fields with commas inside can be difficult with the standard UNIX text tools.

I wrote a program called csvquote to make the data easy for them to handle. In your case, you could use it like this:

csvquote filename.csv | awk 'NR <2 {next}{FS =","}{print $4}' | csvquote -u | more

or you could use cut and tail like this:

csvquote filename.csv | tail -n +3 | cut -d, -f4 | csvquote -u | more

The code and docs are here: https://github.com/dbro/csvquote

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文