嵌入双引号的 CSV 解析
我写了一个简单的 CSV 文件解析器。但是在查看关于 CSV 格式的 wiki 页面之后,我注意到基本的一些“扩展”格式。通过双引号专门嵌入逗号。我已经设法解析这些内容,但是还有第二个问题:嵌入双引号。
示例:
12345,"ABC,""IJK"" XYZ" -> [1234] 和 [ABC, "IJK" XYZ]
我似乎找不到区分带双引号和无双引号的正确方法。所以我的问题是解析 CVS 格式(例如上面的格式)的正确方法/算法是什么?
I've written a simple CSV file parser. But after looking at the wiki page on CSV formats I noticed some "extensions" to the basic format. Specifically embedded comma via double quotes. I've managed to parse those, however there is a second issue: embedded double quotes.
Example:
12345,"ABC, ""IJK"" XYZ" -> [1234] and [ABC, "IJK" XYZ]
I can't seem to find the correct way to distinguish between an enclosed double quote and none. So my question is what is the correct way/algorithm to parse CVS formats such as the one above?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
我通常思考这个问题的方式基本上是将带引号的值视为单个不带引号的值或形成由引号连接的值的双引号值序列。也就是说,
,分割引用字符串的每个双引号段,然后将它们与引号连接在一起。因此:
"ABC, ""IJK"" XYZ"
变为ABC,
,IJK
,XYZ
,依次变为ABC, "IJK" XYZ
The way I normally think about this is basically to look at the quoted value as a single, unquoted value or a sequence of double quoted values that form a value joined by quotes. That is,
essentially, split each double quoted segment of the quoted string and then catenate them together with quotes. thus:
"ABC, ""IJK"" XYZ"
becomesABC,
,IJK
,XYZ
, which in turn becomesABC, "IJK" XYZ
我将使用单字符前瞻来执行此操作,因此如果您正在扫描字符串并找到双引号,请查看下一个字符以查看它是否也是双引号。如果是,则该对表示输出中的单个双引号字符。如果它是任何其他字符,您将查看带引号的字符串的末尾(希望下一个字符是逗号!)。查看下一个字符时也请务必考虑行尾条件。
I would do this using a single character look-ahead, so if you're scanning the string and find a double quote, look at the next character to see if it is also a double quote. If it is, then the pair represents a single doublequote character in the output. If it's any other character, you're looking at the end of the quoted string (and hopefully that next character is a comma!). Be sure to account for the end-of-line condition when looking at the next character, too.
如果找到双引号,那么您应该在单词/字符串的末尾查找双引号。如果找不到,则存在错误。报价也一样。
我建议您尝试 Flex/Bison 来为 CSV 文件编写解析器。这两个工具都将帮助您生成解析器,然后您可以将 C 文件与解析器一起使用并从 C++ 程序中调用它。
在 Flex 上,您创建一个扫描器来查找您的标记,例如“word”或““word””。在 Bison 上,您可以定义语法。
If you find a double-quote, then you should look for a double-quote in the end of the word/string. If you can't find, then there is an error. The same for a quote.
I suggest you try Flex/Bison in order to write a parser for the CSV file. Both tools will help you to generate a parser and then you can use the C files with the parser and call it from your C++ program.
On Flex, you create a scanner that can find your tokens, like "word" or ""word"". On Bison, you define the syntax.
双双引号 (
""
) 是文字双引号,而单双引号 ("
) 用于括起文本(包括逗号)。这里是csv 字段的正则表达式,如果这样可以让事情变得更容易:
如果不在引号中,组 1 将包含该字段,如果在引号中,组 2 将包含该字段,减去周围的引号,在这种情况下,只需替换即可。
""
与"
的所有实例。A double double-quote (
""
) is a literal double-quote, while a lone double-quote ("
) is used for enclosing text (including commas).Here's a regex for a csv field, if that makes things easier:
Group 1 will contain the field if it isn't in quotes, group 2 will contain the field if it is in quotes, minus the surrounding quotes. In that case, just replace all instances of
""
with"
.我建议阅读:停止滚动你自己的 CSV 解析器 和这个 CSV RFC。第一个实际上只是有人希望您使用他们的 C# CSV 解析器,但仍然解释了许多问题。
您的解析器应该一次检查一个字符。我对 D 中的 解析器使用了双布尔策略< /a>.每个引号都会切换字符串是否被引用。当在引用的单元格中时,您会在点击引用时进行标记,并关闭引用。如果下一个字符是引号,则打开引号,将引号添加到结果中并关闭该标志。如果下一个字符不是引号,则该标志将关闭,引用也是如此。
I suggest reading: Stop Rolling Your Own CSV Parser and this CSV RFC. The first is really just someone who wants you to use their C# CSV parser, but still explains many issues.
Your parser should be examining a character at a time. I used a double bool strategy for my parser in D. Each quote toggles weather the string is quoted or not. When in a quoted Cell you flag when hit a quote, and turn off quoting. If the next character is a quote, quoting is turned on, a quote is added to the result and the flag is turned off. If the next character isn't a quote then the flag is turned off and so is quoting.