使用 awk 解析 csv 并忽略字段内的逗号

发布于 2024-10-02 18:31:16 字数 1872 浏览 4 评论 0原文

我有一个 csv 文件,其中每一行定义给定建筑物中的一个房间。除了房间之外,每行都有一个楼层区域。我想要提取的是所有建筑物的所有楼层。

我的文件看起来像这样...

"u_floor","u_room","name"
0,"00BDF","AIRPORT TEST            "
0,0,"BRICKER HALL, JOHN W    "
0,3,"BRICKER HALL, JOHN W    "
0,5,"BRICKER HALL, JOHN W    "
0,6,"BRICKER HALL, JOHN W    "
0,7,"BRICKER HALL, JOHN W    "
0,8,"BRICKER HALL, JOHN W    "
0,9,"BRICKER HALL, JOHN W    "
0,19,"BRICKER HALL, JOHN W    "
0,20,"BRICKER HALL, JOHN W    "
0,21,"BRICKER HALL, JOHN W    "
0,25,"BRICKER HALL, JOHN W    "
0,27,"BRICKER HALL, JOHN W    "
0,29,"BRICKER HALL, JOHN W    "
0,35,"BRICKER HALL, JOHN W    "
0,45,"BRICKER HALL, JOHN W    "
0,59,"BRICKER HALL, JOHN W    "
0,60,"BRICKER HALL, JOHN W    "
0,61,"BRICKER HALL, JOHN W    "
0,63,"BRICKER HALL, JOHN W    "
0,"0006M","BRICKER HALL, JOHN W    "
0,"0008A","BRICKER HALL, JOHN W    "
0,"0008B","BRICKER HALL, JOHN W    "
0,"0008C","BRICKER HALL, JOHN W    "
0,"0008D","BRICKER HALL, JOHN W    "
0,"0008E","BRICKER HALL, JOHN W    "
0,"0008F","BRICKER HALL, JOHN W    "
0,"0008G","BRICKER HALL, JOHN W    "
0,"0008H","BRICKER HALL, JOHN W    "

我想要的是所有建筑物的所有楼层。

我正在使用 cat、awk、sort 和 uniq 来获取此列表,尽管我在建筑物名称字段(例如“BRICKER HALL,JOHN W”)中遇到“,”问题,并且它导致我的整个 csv 生成失败。

cat Buildings.csv | awk -F, '{print $1","$2}' | sort | uniq > Floors.csv 

如何让 awk 使用逗号但忽略字段“”之间的逗号?或者,有人有更好的解决方案吗?

根据提供的建议 awk csv 解析器的答案,我得到了解决方案:

cat Buildings.csv | awk -f csv.awk | awk -F" -> 2|"  '{print $2}' | awk -F"|" '{print $2","$3}' | sort | uniq > floors.csv 

我们想要使用 csv awk 程序然后从那里我想使用“-> 2|”这是基于 csv awk 程序的格式化。那里的 print $2 仅打印 csv 解析的内容,这是因为程序打印原始行,后跟“ -> #”,其中 # 是从 csv 解析的计数。 (即列。)从那里我可以将这个 awk csv 结果拆分为“|” whcih 是它替换逗号的内容。然后排序、uniq 并通过管道输出到文件就完成了!

感谢您的帮助。

I have a csv file where each row defines a room in a given building. Along with room, each row has a floor field. What I want to extract is all floors in all buildings.

My file looks like this...

"u_floor","u_room","name"
0,"00BDF","AIRPORT TEST            "
0,0,"BRICKER HALL, JOHN W    "
0,3,"BRICKER HALL, JOHN W    "
0,5,"BRICKER HALL, JOHN W    "
0,6,"BRICKER HALL, JOHN W    "
0,7,"BRICKER HALL, JOHN W    "
0,8,"BRICKER HALL, JOHN W    "
0,9,"BRICKER HALL, JOHN W    "
0,19,"BRICKER HALL, JOHN W    "
0,20,"BRICKER HALL, JOHN W    "
0,21,"BRICKER HALL, JOHN W    "
0,25,"BRICKER HALL, JOHN W    "
0,27,"BRICKER HALL, JOHN W    "
0,29,"BRICKER HALL, JOHN W    "
0,35,"BRICKER HALL, JOHN W    "
0,45,"BRICKER HALL, JOHN W    "
0,59,"BRICKER HALL, JOHN W    "
0,60,"BRICKER HALL, JOHN W    "
0,61,"BRICKER HALL, JOHN W    "
0,63,"BRICKER HALL, JOHN W    "
0,"0006M","BRICKER HALL, JOHN W    "
0,"0008A","BRICKER HALL, JOHN W    "
0,"0008B","BRICKER HALL, JOHN W    "
0,"0008C","BRICKER HALL, JOHN W    "
0,"0008D","BRICKER HALL, JOHN W    "
0,"0008E","BRICKER HALL, JOHN W    "
0,"0008F","BRICKER HALL, JOHN W    "
0,"0008G","BRICKER HALL, JOHN W    "
0,"0008H","BRICKER HALL, JOHN W    "

What I want is all floors in all buildings.

I am using cat, awk, sort and uniq to obtain this list although I am having a problem with the "," in the building name field such as "BRICKER HALL, JOHN W" and it is throwing off my entire csv generation.

cat Buildings.csv | awk -F, '{print $1","$2}' | sort | uniq > Floors.csv 

How can I get awk to use the comma but ignore a comma in between "" of a field? Alternatively, does someone have a better solution?

Based on the answer provided suggesting a awk csv parser I was able to get the solution:

cat Buildings.csv | awk -f csv.awk | awk -F" -> 2|"  '{print $2}' | awk -F"|" '{print $2","$3}' | sort | uniq > floors.csv 

There we want to use the csv awk program and then from there I want to use a " -> 2|" which is formatting based on the csv awk program. The print $2 there prints only the csv parsed contents, this is because the program prints the original line followed by " -> #" where # is the count parsed from csv. (Ie. the columns.) From there I can split this awk csv result on the "|" whcih is what it replaces the comma's with. Then the sort, uniq and pipe out to a file and done!

Thanks for the help.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

つ可否回来 2024-10-09 18:31:16
gawk -vFPAT='[^,]*|"[^"]*"' '{print $1 "," $3}' | sort | uniq

这是一个很棒的 GNU Awk 4 扩展,您可以在其中定义字段模式而不是字段分隔符模式。对 CSV 来说有奇效。 (文档

预计到达时间(感谢 mitchus): 要删除周围的引号,gsub("^\"|\"$","",$3);如果要以这种方式处理的字段不仅仅是 $3,则只需循环遍历它们即可。
请注意,这种简单的方法不能容忍格式错误的输入,也不能容忍引号之间的某些可能的特殊字符 - 覆盖所有这些将超出整洁的单行代码的范围。

gawk -vFPAT='[^,]*|"[^"]*"' '{print $1 "," $3}' | sort | uniq

This is an awesome GNU Awk 4 extension, where you define a field pattern instead of a field-separator pattern. Does wonders for CSV. (docs)

ETA (thanks mitchus): To remove the surrounding quotes, gsub("^\"|\"$","",$3); if there's more fields than just $3 to process that way, just loop through them.
Note this simple approach is not tolerant of malformed input, nor of some possible special characters between quotes – covering all of those would go beyond the scope of a neat one-liner.

梦里°也失望 2024-10-09 18:31:16

您从 csv.awk 获得的额外输出来自演示代码。您的目的是使用脚本中的函数来进行解析,然后按照您想要的方式输出。

csv.awk 的末尾是演示其中一个函数的 { ... } 循环。正是该代码输出 ->; 2|。

相反,只需调用解析函数并执行 print csv[1], csv[2] 即可。

这部分代码将如下所示:

{
    num_fields = parse_csv($0, csv, ",", "\"", "\"", "\\n", 1);
    if (num_fields < 0) {
        printf "ERROR: %s (%d) -> %s\n", csverr, num_fields, $0;
    } else {
#        printf "%s -> ", $0;
#        printf "%s", num_fields;
#        for (i = 0;i < num_fields;i++) {
#            printf "|%s", csv[i];
#        }
#        printf "|\n";
        print csv[1], csv[2]
    }
}

将其另存为 your_script (例如)。

执行chmod +x your_script

cat 是不必要的。另外,您可以执行 sort -u 而不是 sort |唯一性。

您的命令将如下所示:

./yourscript Buildings.csv | sort -u > floors.csv

The extra output you're getting from csv.awk is from demo code. It's intended that you use the functions within the script to do the parsing and then output it how you want.

At the end of csv.awk is the { ... } loop which demonstrates one of the functions. It's that code that's outputting the -> 2|.

Instead most of that, just call the parsing function and do print csv[1], csv[2].

That part of the code would then look like:

{
    num_fields = parse_csv($0, csv, ",", "\"", "\"", "\\n", 1);
    if (num_fields < 0) {
        printf "ERROR: %s (%d) -> %s\n", csverr, num_fields, $0;
    } else {
#        printf "%s -> ", $0;
#        printf "%s", num_fields;
#        for (i = 0;i < num_fields;i++) {
#            printf "|%s", csv[i];
#        }
#        printf "|\n";
        print csv[1], csv[2]
    }
}

Save it as your_script (for example).

Do chmod +x your_script.

And cat is unnecessary. Also, you can do sort -u instead of sort | uniq.

Your command would then look like:

./yourscript Buildings.csv | sort -u > floors.csv
雪化雨蝶 2024-10-09 18:31:16

我的解决方法是使用以下方法从 csv 中删除逗号:

decommaize () {
  cat $1 | sed 's/"[^"]*"/"((&))"/g' | sed 's/\(\"((\"\)\([^",]*\)\(,\)\([^",]*\)\(\"))\"\)/"\2\4"/g' | sed 's/"(("/"/g' | sed 's/"))"/"/g' > $2
}

也就是说,首先用“((”替换左引号,用“))”替换右引号,然后用“whateverwhatever”替换“((“whatever,whatever”))”,然后将“((”和“))”的所有剩余实例更改回“。

My workaround is to strip commas from the csv using:

decommaize () {
  cat $1 | sed 's/"[^"]*"/"((&))"/g' | sed 's/\(\"((\"\)\([^",]*\)\(,\)\([^",]*\)\(\"))\"\)/"\2\4"/g' | sed 's/"(("/"/g' | sed 's/"))"/"/g' > $2
}

That is, first substitute opening quotes with "((" and closing quotes with "))", then substitute "(("whatever,whatever"))" with "whateverwhatever", then change all remaining instances of "((" and "))" back to ".

鱼窥荷 2024-10-09 18:31:16

你可以尝试这个基于 awk 的 csv paser:

http://lorance.freeshell.org/csv/

You could try this awkbased csv paser:

http://lorance.freeshell.org/csv/

述情 2024-10-09 18:31:16

由于问题实际上是区分 CSV 字段内的逗号和分隔字段的逗号,因此我们可以用其他内容替换第一种逗号,以便更容易进一步解析,即,如下所示:

0,"00BDF","AIRPORT TEST            "
0,0,"BRICKER HALL<comma> JOHN W    "

这个 gawk 脚本 ( Replace-comma.awk) 的作用是:

BEGIN { RS = "(.)" } 
RT == "\x022" { inside++; } 
{ if (inside % 2 && RT == ",") printf("<comma>"); else printf(RT); }

它使用 gawk 功能将实际的记录分隔符捕获到名为 RT 的变量中。它将每个字符拆分为一条记录,当我们读取记录时,我们将引号 (\x022) 中遇到的逗号替换为

FPAT 解决方案在一种特殊情况下失败,即您同时有转义引号和引号内的逗号,但此解决方案在所有情况下都适用,即,

§ echo '"Adams, John ""Big Foot""",1' | gawk -vFPAT='[^,]*|"[^"]*"' '{ print $1 }'
"Adams, John "
§ echo '"Adams, John ""Big Foot""",1' | gawk -f replace-comma.awk | gawk -F, '{ print $1; }'
"Adams<comma> John ""Big Foot""",1

作为易于复制粘贴的一行:

gawk 'BEGIN { RS = "(.)" } RT == "\x022" { inside++; } { if (inside % 2 && RT == ",") printf("<comma>"); else printf(RT); }'

Since the problem is really to distinguish between a comma inside a CSV field and the one that separates fields, we can replace the first kind of comma with something else so that it easier to parse further, i.e., something like this:

0,"00BDF","AIRPORT TEST            "
0,0,"BRICKER HALL<comma> JOHN W    "

This gawk script (replace-comma.awk) does that:

BEGIN { RS = "(.)" } 
RT == "\x022" { inside++; } 
{ if (inside % 2 && RT == ",") printf("<comma>"); else printf(RT); }

This uses a gawk feature that captures the actual record separator into a variable called RT. It splits every character into a record, and as we are reading through the records, we replace the comma encountered inside a quote (\x022) with <comma>.

The FPAT solution fails in one special case where you have both escaped quotes and a comma inside quotes but this solution works in all cases, i.e,

§ echo '"Adams, John ""Big Foot""",1' | gawk -vFPAT='[^,]*|"[^"]*"' '{ print $1 }'
"Adams, John "
§ echo '"Adams, John ""Big Foot""",1' | gawk -f replace-comma.awk | gawk -F, '{ print $1; }'
"Adams<comma> John ""Big Foot""",1

As a one-liner for easy copy-paste:

gawk 'BEGIN { RS = "(.)" } RT == "\x022" { inside++; } { if (inside % 2 && RT == ",") printf("<comma>"); else printf(RT); }'
雨夜星沙 2024-10-09 18:31:16

您可以使用我编写的名为 csvquote 的脚本来让 awk 忽略引用字段内的逗号。然后命令将变为:

csvquote Buildings.csv | awk -F, '{print $1","$2}' | sort | uniq | csvquote -u > Floors.csv

并且 cut 可能比 awk 更容易一点:

csvquote Buildings.csv | cut -d, -f1,2 | sort | uniq | csvquote -u > Floors.csv

您可以在此处找到 csvquote 代码: https ://github.com/dbro/csvquote

You can use a script I wrote called csvquote to let awk ignore the commas inside the quoted fields. The command would then become:

csvquote Buildings.csv | awk -F, '{print $1","$2}' | sort | uniq | csvquote -u > Floors.csv

and cut might be a bit easier than awk for this:

csvquote Buildings.csv | cut -d, -f1,2 | sort | uniq | csvquote -u > Floors.csv

You can find the csvquote code here: https://github.com/dbro/csvquote

清晰传感 2024-10-09 18:31:16

成熟的 CSV 解析器(例如 Perl 的 Text::CSV_XS)是专门为处理这种奇怪的情况而构建的。

perl -MText::CSV_XS -lne 'BEGIN{$csv=Text::CSV_XS->new()} if($csv->parse($_)){ @f=$csv->;字段(); print "$f[0],$f[1]" }' file

输入行被分割成数组 @f
字段 1 是 $f[0] 因为 Perl 从 0 输出开始索引

u_floor,u_room
0,00BDF
0,0
0,3
0,5
0,6
0,7
0,8
0,9
0,19
0,20
0,21
0,25
0,27
0,29
0,35
0,45
0,59
0,60
0,61
0,63
0,0006M
0,0008A
0,0008B
0,0008C
0,0008D
0,0008E
0,0008F
0,0008G
0,0008H

我在此处的答案中提供了 Text::CSV_XS 的更多解释:使用 gawk 解析 csv 文件

Fully fledged CSV parsers such as Perl's Text::CSV_XS are purpose-built to handle that kind of weirdness.

perl -MText::CSV_XS -lne 'BEGIN{$csv=Text::CSV_XS->new()} if($csv->parse($_)){ @f=$csv->fields(); print "$f[0],$f[1]" }' file

The input line is split into array @f
Field 1 is $f[0] since Perl starts indexing at 0

output:

u_floor,u_room
0,00BDF
0,0
0,3
0,5
0,6
0,7
0,8
0,9
0,19
0,20
0,21
0,25
0,27
0,29
0,35
0,45
0,59
0,60
0,61
0,63
0,0006M
0,0008A
0,0008B
0,0008C
0,0008D
0,0008E
0,0008F
0,0008G
0,0008H

I provided more explanation of Text::CSV_XS within my answer here: parse csv file using gawk

辞旧 2024-10-09 18:31:16

如何让 awk 使用逗号但忽略字段“”之间的逗号?

$ lsb_release -a | grep ^Description
Description:    Ubuntu 20.04.2
$ awk --version                                                                          
GNU Awk 5.3.0, API 4.0, PMA Avon 8-g1, (GNU MPFR 4.2.1, GNU MP 6.3.0)
$ cat test

0,"00BDF","AIRPORT TEST            "
0,0,"BRICKER HALL, JOHN W    "
0,3,"BRICKER HALL, JOHN W    "
0,5,"BRICKER HALL, JOHN W    " $

$ awk --csv '{print $3}' 测试

机场测试
布里克·霍尔,约翰·W
布里克·霍尔,约翰·W
布里克·霍尔,约翰·W
...

How can I get awk to use the comma but ignore a comma in between "" of a field?

$ lsb_release -a | grep ^Description
Description:    Ubuntu 20.04.2
$ awk --version                                                                          
GNU Awk 5.3.0, API 4.0, PMA Avon 8-g1, (GNU MPFR 4.2.1, GNU MP 6.3.0)
$ cat test

0,"00BDF","AIRPORT TEST            "
0,0,"BRICKER HALL, JOHN W    "
0,3,"BRICKER HALL, JOHN W    "
0,5,"BRICKER HALL, JOHN W    " $

$ awk --csv '{print $3}' test

AIRPORT TEST
BRICKER HALL, JOHN W
BRICKER HALL, JOHN W
BRICKER HALL, JOHN W
...

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文