正则表达式来解析有趣的 CSV?
我需要使用 AWK 解析 CSV 文件。 CSV 中的一行可能如下所示:
"hello, world?",1 thousand,"oneword",,,"last one"
一些重要的观察结果: - 带引号的字符串内的字段可以包含逗号和多个单词 - 未加引号的字段可以是多个世界 -field 可以通过连续两个逗号为空有关
编写正则表达式以正确拆分此行的任何线索吗?
谢谢!
I need to parse an CSV file using AWK. A line in the CSV could look like this:
"hello, world?",1 thousand,"oneword",,,"last one"
Some important observations:
-field inside quoted string can contain commas and multiple words
-unquoted field can be multiple worlds
-field can be empty by just having two commas in a row
Any clues on writing a regex expression to split this line up properly?
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
正如许多人所观察到的,CSV 是一种比最初出现时更难的格式。有许多边缘情况和含糊之处。作为一个歧义的例子,在您的示例中,“,,,”是一个带有逗号的字段还是两个空白字段?
Perl、python、Java 等能够更好地处理 CSV,因为它们拥有经过良好测试的库。正则表达式会更加脆弱。
借助 AWK,我在 THIS AWK 函数方面取得了一些成功。它在 AWK、gawk 和 nawk 下运行。
在示例数据上运行它会产生:
示例 Perl 解决方案:
As many have observed, CSV is a harder format than it first appears. There are many edge cases and ambiguities. As an example ambiguity, in your example, is ',,,' a field with a comma or two blank fields?
Perl, python, Java, etc are better equipped to deal with CSV because they have well tested libraries for the same. A regex will be more fragile.
With AWK, I have had some success with THIS AWK function. It works under AWK, gawk and nawk.
Running it on your example data produces:
An example Perl solution:
试试这个:
不过我还没有用 AWK 测试过它。
Try this:
I haven't tested it with AWK though.