如何处理 bash 脚本读取的 CSV 文件中的逗号

发布于 2024-12-27 19:28:24 字数 931 浏览 0 评论 0原文

我正在创建一个 bash 脚本来从 CSV 文件生成一些输出(我有超过 1000 个条目,并且不喜欢手动执行...)。

CSV 文件的内容看起来与此类似:

Australian Capital Territory,AU-ACT,20034,AU,Australia
Piaui,BR-PI,20100,BR,Brazil
"Adygeya, Republic",RU-AD,21250,RU,Russian Federation

我有一些代码可以使用逗号作为分隔符来分隔字段,但有些值实际上包含逗号,例如 Adygeya, Republic。这些值用引号括起来,指示其中的字符应被视为字段的一部分,但我不知道如何解析它以考虑到这一点。

目前我有这个循环:

while IFS=, read province provinceCode criteriaId countryCode country
do
    echo "[$province] [$provinceCode] [$criteriaId] [$countryCode] [$country]"
done < $input

它为上面给出的示例数据生成此输出:

[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
[Piaui] [BR-PI] [20100] [BR] [Brazil]
["Adygeya] [ Republic"] [RU-AD] [21250] [RU,Russian Federation]

如您所见,第三个条目的解析不正确。我想让它输出

[Adygeya Republic] [RU-AD] [21250] [RU] [Russian Federation]

I'm creating a bash script to generate some output from a CSV file (I have over 1000 entries and don't fancy doing it by hand...).

The content of the CSV file looks similar to this:

Australian Capital Territory,AU-ACT,20034,AU,Australia
Piaui,BR-PI,20100,BR,Brazil
"Adygeya, Republic",RU-AD,21250,RU,Russian Federation

I have some code that can separate the fields using the comma as delimiter, but some values actually contain commas, such as Adygeya, Republic. These values are surrounded by quotes to indicate the characters within should be treated as part of the field, but I don't know how to parse it to take this into account.

Currently I have this loop:

while IFS=, read province provinceCode criteriaId countryCode country
do
    echo "[$province] [$provinceCode] [$criteriaId] [$countryCode] [$country]"
done < $input

which produces this output for the sample data given above:

[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
[Piaui] [BR-PI] [20100] [BR] [Brazil]
["Adygeya] [ Republic"] [RU-AD] [21250] [RU,Russian Federation]

As you can see, the third entry is parsed incorrectly. I want it to output

[Adygeya Republic] [RU-AD] [21250] [RU] [Russian Federation]

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

千年*琉璃梦 2025-01-03 19:28:24

如果您想在 awk 中完成所有操作(此脚本需要 GNU awk 4 才能按预期工作):

awk '{ 
 for (i = 0; ++i <= NF;) {
   substr($i, 1, 1) == "\"" && 
     $i = substr($i, 2, length($i) - 2)
   printf "[%s]%s", $i, (i < NF ? OFS : RS)
    }   
 }' FPAT='([^,]+)|("[^"]+")' infile

示例输出:

% cat infile
Australian Capital Territory,AU-ACT,20034,AU,Australia
Piaui,BR-PI,20100,BR,Brazil
"Adygeya, Republic",RU-AD,21250,RU,Russian Federation
% awk '{    
 for (i = 0; ++i <= NF;) {
   substr($i, 1, 1) == "\"" &&
     $i = substr($i, 2, length($i) - 2)
   printf "[%s]%s", $i, (i < NF ? OFS : RS)
    }
 }' FPAT='([^,]+)|("[^"]+")' infile
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
[Piaui] [BR-PI] [20100] [BR] [Brazil]
[Adygeya, Republic] [RU-AD] [21250] [RU] [Russian Federation]

使用 < em>Perl:

perl -MText::ParseWords -lne'
 print join " ", map "[$_]", 
   parse_line(",",0, $_);
  ' infile 

这应该适用于您的 awk 版本(基于 this cus 帖子,也删除了嵌入的逗号)。

awk '{
 n = parse_csv($0, data)
 for (i = 0; ++i <= n;) {
    gsub(/,/, " ", data[i])
    printf "[%s]%s", data[i], (i < n ? OFS : RS)
    }
  }
function parse_csv(str, array,   field, i) { 
  split( "", array )
  str = str ","
  while ( match(str, /[ \t]*("[^"]*(""[^"]*)*"|[^,]*)[ \t]*,/) ) { 
    field = substr(str, 1, RLENGTH)
    gsub(/^[ \t]*"?|"?[ \t]*,$/, "", field)
    gsub(/""/, "\"", field)
    array[++i] = field
    str = substr(str, RLENGTH + 1)
  }
  return i
}' infile

If you want to do it all in awk (GNU awk 4 is required for this script to work as intended):

awk '{ 
 for (i = 0; ++i <= NF;) {
   substr($i, 1, 1) == "\"" && 
     $i = substr($i, 2, length($i) - 2)
   printf "[%s]%s", $i, (i < NF ? OFS : RS)
    }   
 }' FPAT='([^,]+)|("[^"]+")' infile

Sample output:

% cat infile
Australian Capital Territory,AU-ACT,20034,AU,Australia
Piaui,BR-PI,20100,BR,Brazil
"Adygeya, Republic",RU-AD,21250,RU,Russian Federation
% awk '{    
 for (i = 0; ++i <= NF;) {
   substr($i, 1, 1) == "\"" &&
     $i = substr($i, 2, length($i) - 2)
   printf "[%s]%s", $i, (i < NF ? OFS : RS)
    }
 }' FPAT='([^,]+)|("[^"]+")' infile
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
[Piaui] [BR-PI] [20100] [BR] [Brazil]
[Adygeya, Republic] [RU-AD] [21250] [RU] [Russian Federation]

With Perl:

perl -MText::ParseWords -lne'
 print join " ", map "[$_]", 
   parse_line(",",0, $_);
  ' infile 

This should work with your awk version (based on this c.u.s. post, removed the embedded commas too).

awk '{
 n = parse_csv($0, data)
 for (i = 0; ++i <= n;) {
    gsub(/,/, " ", data[i])
    printf "[%s]%s", data[i], (i < n ? OFS : RS)
    }
  }
function parse_csv(str, array,   field, i) { 
  split( "", array )
  str = str ","
  while ( match(str, /[ \t]*("[^"]*(""[^"]*)*"|[^,]*)[ \t]*,/) ) { 
    field = substr(str, 1, RLENGTH)
    gsub(/^[ \t]*"?|"?[ \t]*,$/, "", field)
    gsub(/""/, "\"", field)
    array[++i] = field
    str = substr(str, RLENGTH + 1)
  }
  return i
}' infile
霞映澄塘 2025-01-03 19:28:24

此处查看@Dimitre的解决方案后。您可以执行类似的操作 -

#!/usr/local/bin/gawk -f

BEGIN {
    FS="," 
    FPAT="([^,]+)|(\"[^\"]+\")"
    }

      {
    for (i=1;i<=NF;i++) 
        printf ("[%s] ",$i);
    print ""
    } 

测试:

[jaypal:~/Temp] cat filename
Australian Capital Territory,AU-ACT,20034,AU,Australia
Piaui,BR-PI,20100,BR,Brazil
"Adygeya, Republic",RU-AD,21250,RU,Russian Federation

[jaypal:~/Temp] ./script.awk  filename
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia] 
[Piaui] [BR-PI] [20100] [BR] [Brazil] 
["Adygeya, Republic"] [RU-AD] [21250] [RU] [Russian Federation] 

要删除 ",您可以将输出通过管道传输到 sed

[jaypal:~/Temp] ./script.awk  filename | sed 's#\"##g'
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia] 
[Piaui] [BR-PI] [20100] [BR] [Brazil] 
[Adygeya, Republic] [RU-AD] [21250] [RU] [Russian Federation] 

After looking at @Dimitre's solution over here. You can do something like this -

#!/usr/local/bin/gawk -f

BEGIN {
    FS="," 
    FPAT="([^,]+)|(\"[^\"]+\")"
    }

      {
    for (i=1;i<=NF;i++) 
        printf ("[%s] ",$i);
    print ""
    } 

Test:

[jaypal:~/Temp] cat filename
Australian Capital Territory,AU-ACT,20034,AU,Australia
Piaui,BR-PI,20100,BR,Brazil
"Adygeya, Republic",RU-AD,21250,RU,Russian Federation

[jaypal:~/Temp] ./script.awk  filename
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia] 
[Piaui] [BR-PI] [20100] [BR] [Brazil] 
["Adygeya, Republic"] [RU-AD] [21250] [RU] [Russian Federation] 

For removing " you can pipe the output to sed.

[jaypal:~/Temp] ./script.awk  filename | sed 's#\"##g'
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia] 
[Piaui] [BR-PI] [20100] [BR] [Brazil] 
[Adygeya, Republic] [RU-AD] [21250] [RU] [Russian Federation] 
场罚期间 2025-01-03 19:28:24

在思考这个问题之后,我意识到由于字符串中的逗号对我来说并不重要,因此在解析之前将其从输入中删除会更容易。

为此,我编写了一个 sed 命令,它匹配包含逗号的双引号引起来的字符串。然后,该命令从匹配的字符串中删除不需要的位。它通过将正则表达式分成记住的部分来实现这一点。

此解决方案仅适用于字符串在双引号之间包含单个逗号的情况。

未转义的正则表达式是

(")(.*)(,)(.*)(")

第一对、第三对和第五对括号分别捕获左双引号、逗号和右双引号。

第二对和第三对括号捕获我们要保留的字段的实际内容。

sed 删除逗号的命令

echo "$input" | sed 's/\(\"\)\(.*\)\(,\)\(.*\)\(\"\)/\1\2\3\4/' 

sed 删除逗号和双引号的命令

echo "$input" | sed 's/\(\"\)\(.*\)\(,\)\(.*\)\(\"\)/\2\3/' 

更新代码:

tmpFile=$input"Temp"
sed 's/\(\"\)\(.*\)\(,\)\(.*\)\(\"\)/\2\4/' < $input > $tmpFile
while IFS=, read province provinceCode criteriaId countryCode country
do
    echo "[$province] [$provinceCode] [$criteriaId] [$countryCode] [$country]"
done < $tmpFile
rm $tmpFile

<强>输出:

[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
[Piaui] [BR-PI] [20100] [BR] [Brazil]
[Adygeya Republic] [RU-AD] [21250] [RU] [Russian Federation]
[Bío-Bío] [CL-BI] [20154] [CL] [Chile]

After thinking about the problem, I realised that since the comma in the string isn't important to me, it'd be easier to simply remove it from the input before parsing.

To that end, I've concocted a sed command that matches strings surrounded by doubled quotes that contain a comma. The command then removes the bits you don't want from the matched string. It does this by separating the regex into remembered sections.

This solution only works where the string contains a single comma between double quotes.

The unescaped regex is

(")(.*)(,)(.*)(")

The first, third, and fifth pairs of parentheses capture the opening double quote, comma, and closing double quote respectively.

The second and third pairs of parentheses capture the actual content of the field which we want to keep.

sed Command To Remove Comma:

echo "$input" | sed 's/\(\"\)\(.*\)\(,\)\(.*\)\(\"\)/\1\2\3\4/' 

sed Command To Remove Comma and Double Quotes:

echo "$input" | sed 's/\(\"\)\(.*\)\(,\)\(.*\)\(\"\)/\2\3/' 

Updated Code:

tmpFile=$input"Temp"
sed 's/\(\"\)\(.*\)\(,\)\(.*\)\(\"\)/\2\4/' < $input > $tmpFile
while IFS=, read province provinceCode criteriaId countryCode country
do
    echo "[$province] [$provinceCode] [$criteriaId] [$countryCode] [$country]"
done < $tmpFile
rm $tmpFile

Output:

[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
[Piaui] [BR-PI] [20100] [BR] [Brazil]
[Adygeya Republic] [RU-AD] [21250] [RU] [Russian Federation]
[Bío-Bío] [CL-BI] [20154] [CL] [Chile]
一念一轮回 2025-01-03 19:28:24

如果您可以容忍周围的引号保留在输出中,则可以使用我编写的一个名为 csvquote 的小脚本来启用 awk 和 cut(以及其他 UNIX 文本工具)来正确处理包含逗号的带引号的字段。您可以像这样包装命令:

csvquote inputfile.csv | awk -F, '{print "["$1"] ["$2"] ["$3"] ["$4"] ["$5"]"}' | csvquote -u

请参阅 https://github.com/dbro/csvquote 获取代码并文档

If you can tolerate having the surrounding quotes persist in the output, you can use a small script I wrote called csvquote to enable awk and cut (and other UNIX text tools) to properly handle quoted fields that contain commas. You wrap the command like this:

csvquote inputfile.csv | awk -F, '{print "["$1"] ["$2"] ["$3"] ["$4"] ["$5"]"}' | csvquote -u

see https://github.com/dbro/csvquote for the code and documentation

笙痞 2025-01-03 19:28:24

由于我的系统上的 awk 版本稍显过时,并且个人偏好坚持使用 Bash 脚本,我得到了一个略有不同的解决方案。

我根据这篇博文制作了一个实用程序脚本它会解析 CSV 文件并用您选择的分隔符替换分隔符,以便可以捕获输出并用于轻松处理数据。该脚本尊重带引号的字符串和嵌入的逗号,但会删除它找到的双引号,并且不适用于字段中的转义双引号。

#!/bin/bash

input=$1
delimiter=$2

if [ -z "$input" ];
then
    echo "Input file must be passed as an argument!"
    exit 98
fi

if ! [ -f $input ] || ! [ -e $input ];
then
    echo "Input file '"$input"' doesn't exist!"
    exit 99
fi

if [ -z "$delimiter" ];
then
    echo "Delimiter character must be passed as an argument!"
    exit 98
fi

gawk '{
    c=0
    $0=$0","                                   # yes, cheating
    while($0) {
        delimiter=""
        if (c++ > 0) # Evaluate and then increment c
        {
            delimiter="'$delimiter'"
        }

        match($0,/ *"[^"]*" *,|[^,]*,/)
        s=substr($0,RSTART,RLENGTH)             # save what matched in f
        gsub(/^ *"?|"? *,$/,"",s)               # remove extra stuff
        printf (delimiter s)
        $0=substr($0,RLENGTH+1)                 # "consume" what matched
    }
    printf ("\n")
}' $input

只是将其发布以防其他人发现它有用。

Owing to the slightly outdated version of awk on my system and a personal preference to stick to a Bash script, I've arrived a slightly different solution.

I've produced a utility script based on this blog post that parses the CSV file and replaces the delimiters with a delimiter of your choice so that the output can be captured and used to easily process the data. The script respects quoted strings and embedded commas, but will remove the double quotes it finds and doesn't work with escaped double quotes within fields.

#!/bin/bash

input=$1
delimiter=$2

if [ -z "$input" ];
then
    echo "Input file must be passed as an argument!"
    exit 98
fi

if ! [ -f $input ] || ! [ -e $input ];
then
    echo "Input file '"$input"' doesn't exist!"
    exit 99
fi

if [ -z "$delimiter" ];
then
    echo "Delimiter character must be passed as an argument!"
    exit 98
fi

gawk '{
    c=0
    $0=$0","                                   # yes, cheating
    while($0) {
        delimiter=""
        if (c++ > 0) # Evaluate and then increment c
        {
            delimiter="'$delimiter'"
        }

        match($0,/ *"[^"]*" *,|[^,]*,/)
        s=substr($0,RSTART,RLENGTH)             # save what matched in f
        gsub(/^ *"?|"? *,$/,"",s)               # remove extra stuff
        printf (delimiter s)
        $0=substr($0,RLENGTH+1)                 # "consume" what matched
    }
    printf ("\n")
}' $input

Just posting it up in case someone else finds it useful.

蓝咒 2025-01-03 19:28:24

使用 Dimitre 的解决方案(谢谢您)我注意到他的程序忽略了空字段。

这是修复方法:

awk '{ 
 for (i = 0; ++i <= NF;) {
   substr($i, 1, 1) == "\"" && 
     $i = substr($i, 2, length($i) - 2)
   printf "[%s]%s", $i, (i < NF ? OFS : RS)
    }   
 }' FPAT='([^,]*)|("[^"]+")' infile

Using Dimitre's solution (thank you for that) I noticed that his program ignores empty fields.

Here is the fix:

awk '{ 
 for (i = 0; ++i <= NF;) {
   substr($i, 1, 1) == "\"" && 
     $i = substr($i, 2, length($i) - 2)
   printf "[%s]%s", $i, (i < NF ? OFS : RS)
    }   
 }' FPAT='([^,]*)|("[^"]+")' infile
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文