嵌入双引号的 CSV 解析

发布于 2024-10-08 10:31:04 字数 335 浏览 3 评论 0原文

我写了一个简单的 CSV 文件解析器。但是在查看关于 CSV 格式的 wiki 页面之后,我注意到基本的一些“扩展”格式。通过双引号专门嵌入逗号。我已经设法解析这些内容,但是还有第二个问题:嵌入双引号。

示例:

12345,"ABC,""IJK"" XYZ" -> [1234] 和 [ABC, "IJK" XYZ]

我似乎找不到区分带双引号和无双引号的正确方法。所以我的问题是解析 CVS 格式(例如上面的格式)的正确方法/算法是什么?

I've written a simple CSV file parser. But after looking at the wiki page on CSV formats I noticed some "extensions" to the basic format. Specifically embedded comma via double quotes. I've managed to parse those, however there is a second issue: embedded double quotes.

Example:

12345,"ABC, ""IJK"" XYZ" -> [1234] and [ABC, "IJK" XYZ]

I can't seem to find the correct way to distinguish between an enclosed double quote and none. So my question is what is the correct way/algorithm to parse CVS formats such as the one above?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

陌若浮生 2024-10-15 10:31:04

我通常思考这个问题的方式基本上是将带引号的值视为单个不带引号的值形成由引号连接的值的双引号值序列。也就是说,

  • 解析行中的下一个原子:
    • 读取第一个非空白字符
    • 如果当前字符不是引号:
      • 标记当前位置
      • 读到下一个逗号或换行符
      • 返回标记和逗号之前的字符之间的文本(如果适用,请去掉空格)
    • 如果当前字符是引号:
      • 创建一个空字符串缓冲区
      • 当前字符不是引号
        • 标记当前位置+1(跳过引号字符)
        • 阅读下一条引言
        • 如果缓冲区不为空,则在其中添加引号
        • 将标记和当前位置之前的字符之间的文本附加到缓冲区(以去掉两个引号)
        • 前进一个字符(超过刚刚读过的引言)
      • 读到下一个逗号或换行符
      • 返回缓冲区

,分割引用字符串的每个双引号段,然后将它们与引号连接在一起。因此:"ABC, ""IJK"" XYZ" 变为 ABC, , IJK,  XYZ,依次变为ABC, "IJK" XYZ

The way I normally think about this is basically to look at the quoted value as a single, unquoted value or a sequence of double quoted values that form a value joined by quotes. That is,

  • to parse the next atom in the row:
    • read up to the first non whitespace character
    • if the current character is not a quote:
      • mark the current spot
      • read up to the next comma or newline
      • return the text between the mark and the character before the comma (strip spaces if appropriate)
    • if the current character is a quote:
      • create an empty string buffer
      • while the current character is not a quote
        • mark the current position +1 (skip the quote character)
        • read up to the next quote
        • if the buffer is not empty, append a quote to it
        • append to the buffer the text between the mark and the character before the current position (to strip both quotes)
        • advance one character (past the just read quote)
      • read up to the next comma or newline
      • return the buffer

essentially, split each double quoted segment of the quoted string and then catenate them together with quotes. thus: "ABC, ""IJK"" XYZ" becomes ABC, , IJK,  XYZ, which in turn becomes ABC, "IJK" XYZ

内心旳酸楚 2024-10-15 10:31:04

我将使用单字符前瞻来执行此操作,因此如果您正在扫描字符串并找到双引号,请查看下一个字符以查看它是否也是双引号。如果是,则该对表示输出中的单个双引号字符。如果它是任何其他字符,您将查看带引号的字符串的末尾(希望下一个字符是逗号!)。查看下一个字符时也请务必考虑行尾条件。

I would do this using a single character look-ahead, so if you're scanning the string and find a double quote, look at the next character to see if it is also a double quote. If it is, then the pair represents a single doublequote character in the output. If it's any other character, you're looking at the end of the quoted string (and hopefully that next character is a comma!). Be sure to account for the end-of-line condition when looking at the next character, too.

瀟灑尐姊 2024-10-15 10:31:04

如果找到双引号,那么您应该在单词/字符串的末尾查找双引号。如果找不到,则存在错误。报价也一样。

我建议您尝试 Flex/Bison 来为 CSV 文件编写解析器。这两个工具都将帮助您生成解析器,然后您可以将 C 文件与解析器一起使用并从 C++ 程序中调用它。
在 Flex 上,您创建一个扫描器来查找您的标记,例如“word”或““word””。在 Bison 上,您可以定义语法。

If you find a double-quote, then you should look for a double-quote in the end of the word/string. If you can't find, then there is an error. The same for a quote.

I suggest you try Flex/Bison in order to write a parser for the CSV file. Both tools will help you to generate a parser and then you can use the C files with the parser and call it from your C++ program.
On Flex, you create a scanner that can find your tokens, like "word" or ""word"". On Bison, you define the syntax.

懵少女 2024-10-15 10:31:04

双双引号 ("") 是文字双引号,而单双引号 (") 用于括起文本(包括逗号)。

这里是csv 字段的正则表达式,如果这样可以让事情变得更容易:

([^",\n][^,\n]*)|"((?:[^"]|"")+)"

如果不在引号中,组 1 将包含该字段,如果在引号中,组 2 将包含该字段,减去周围的引号,在这种情况下,只需替换即可。 """ 的所有实例。

A double double-quote ("") is a literal double-quote, while a lone double-quote (") is used for enclosing text (including commas).

Here's a regex for a csv field, if that makes things easier:

([^",\n][^,\n]*)|"((?:[^"]|"")+)"

Group 1 will contain the field if it isn't in quotes, group 2 will contain the field if it is in quotes, minus the surrounding quotes. In that case, just replace all instances of "" with ".

你怎么敢 2024-10-15 10:31:04

I suggest reading: Stop Rolling Your Own CSV Parser and this CSV RFC. The first is really just someone who wants you to use their C# CSV parser, but still explains many issues.

Your parser should be examining a character at a time. I used a double bool strategy for my parser in D. Each quote toggles weather the string is quoted or not. When in a quoted Cell you flag when hit a quote, and turn off quoting. If the next character is a quote, quoting is turned on, a quote is added to the result and the flag is turned off. If the next character isn't a quote then the flag is turned off and so is quoting.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文