如何处理 bash 脚本读取的 CSV 文件中的逗号
我正在创建一个 bash 脚本来从 CSV 文件生成一些输出(我有超过 1000 个条目,并且不喜欢手动执行...)。
CSV 文件的内容看起来与此类似:
Australian Capital Territory,AU-ACT,20034,AU,Australia
Piaui,BR-PI,20100,BR,Brazil
"Adygeya, Republic",RU-AD,21250,RU,Russian Federation
我有一些代码可以使用逗号作为分隔符来分隔字段,但有些值实际上包含逗号,例如 Adygeya, Republic
。这些值用引号括起来,指示其中的字符应被视为字段的一部分,但我不知道如何解析它以考虑到这一点。
目前我有这个循环:
while IFS=, read province provinceCode criteriaId countryCode country
do
echo "[$province] [$provinceCode] [$criteriaId] [$countryCode] [$country]"
done < $input
它为上面给出的示例数据生成此输出:
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
[Piaui] [BR-PI] [20100] [BR] [Brazil]
["Adygeya] [ Republic"] [RU-AD] [21250] [RU,Russian Federation]
如您所见,第三个条目的解析不正确。我想让它输出
[Adygeya Republic] [RU-AD] [21250] [RU] [Russian Federation]
I'm creating a bash script to generate some output from a CSV file (I have over 1000 entries and don't fancy doing it by hand...).
The content of the CSV file looks similar to this:
Australian Capital Territory,AU-ACT,20034,AU,Australia
Piaui,BR-PI,20100,BR,Brazil
"Adygeya, Republic",RU-AD,21250,RU,Russian Federation
I have some code that can separate the fields using the comma as delimiter, but some values actually contain commas, such as Adygeya, Republic
. These values are surrounded by quotes to indicate the characters within should be treated as part of the field, but I don't know how to parse it to take this into account.
Currently I have this loop:
while IFS=, read province provinceCode criteriaId countryCode country
do
echo "[$province] [$provinceCode] [$criteriaId] [$countryCode] [$country]"
done < $input
which produces this output for the sample data given above:
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
[Piaui] [BR-PI] [20100] [BR] [Brazil]
["Adygeya] [ Republic"] [RU-AD] [21250] [RU,Russian Federation]
As you can see, the third entry is parsed incorrectly. I want it to output
[Adygeya Republic] [RU-AD] [21250] [RU] [Russian Federation]
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
如果您想在 awk 中完成所有操作(此脚本需要 GNU awk 4 才能按预期工作):
示例输出:
使用 < em>Perl:
这应该适用于您的 awk 版本(基于 this cus 帖子,也删除了嵌入的逗号)。
If you want to do it all in awk (GNU awk 4 is required for this script to work as intended):
Sample output:
With Perl:
This should work with your awk version (based on this c.u.s. post, removed the embedded commas too).
在此处查看@Dimitre的解决方案后。您可以执行类似的操作 -
测试:
要删除
"
,您可以将输出通过管道传输到sed
。After looking at @Dimitre's solution over here. You can do something like this -
Test:
For removing
"
you can pipe the output tosed
.在思考这个问题之后,我意识到由于字符串中的逗号对我来说并不重要,因此在解析之前将其从输入中删除会更容易。
为此,我编写了一个
sed
命令,它匹配包含逗号的双引号引起来的字符串。然后,该命令从匹配的字符串中删除不需要的位。它通过将正则表达式分成记住的部分来实现这一点。此解决方案仅适用于字符串在双引号之间包含单个逗号的情况。
未转义的正则表达式是
第一对、第三对和第五对括号分别捕获左双引号、逗号和右双引号。
第二对和第三对括号捕获我们要保留的字段的实际内容。
sed
删除逗号的命令:sed
删除逗号和双引号的命令:更新代码:
<强>输出:
After thinking about the problem, I realised that since the comma in the string isn't important to me, it'd be easier to simply remove it from the input before parsing.
To that end, I've concocted a
sed
command that matches strings surrounded by doubled quotes that contain a comma. The command then removes the bits you don't want from the matched string. It does this by separating the regex into remembered sections.This solution only works where the string contains a single comma between double quotes.
The unescaped regex is
The first, third, and fifth pairs of parentheses capture the opening double quote, comma, and closing double quote respectively.
The second and third pairs of parentheses capture the actual content of the field which we want to keep.
sed
Command To Remove Comma:sed
Command To Remove Comma and Double Quotes:Updated Code:
Output:
如果您可以容忍周围的引号保留在输出中,则可以使用我编写的一个名为 csvquote 的小脚本来启用 awk 和 cut(以及其他 UNIX 文本工具)来正确处理包含逗号的带引号的字段。您可以像这样包装命令:
请参阅 https://github.com/dbro/csvquote 获取代码并文档
If you can tolerate having the surrounding quotes persist in the output, you can use a small script I wrote called csvquote to enable awk and cut (and other UNIX text tools) to properly handle quoted fields that contain commas. You wrap the command like this:
see https://github.com/dbro/csvquote for the code and documentation
由于我的系统上的
awk
版本稍显过时,并且个人偏好坚持使用 Bash 脚本,我得到了一个略有不同的解决方案。我根据这篇博文制作了一个实用程序脚本它会解析 CSV 文件并用您选择的分隔符替换分隔符,以便可以捕获输出并用于轻松处理数据。该脚本尊重带引号的字符串和嵌入的逗号,但会删除它找到的双引号,并且不适用于字段中的转义双引号。
只是将其发布以防其他人发现它有用。
Owing to the slightly outdated version of
awk
on my system and a personal preference to stick to a Bash script, I've arrived a slightly different solution.I've produced a utility script based on this blog post that parses the CSV file and replaces the delimiters with a delimiter of your choice so that the output can be captured and used to easily process the data. The script respects quoted strings and embedded commas, but will remove the double quotes it finds and doesn't work with escaped double quotes within fields.
Just posting it up in case someone else finds it useful.
使用 Dimitre 的解决方案(谢谢您)我注意到他的程序忽略了空字段。
这是修复方法:
Using Dimitre's solution (thank you for that) I noticed that his program ignores empty fields.
Here is the fix: