根据每次确定性有限自动机达到最终状态来分割字符串?

发布于 2024-10-08 03:11:59 字数 652 浏览 4 评论 0原文

我有一个问题,有一个可以通过迭代解决的解决方案,但我想知道是否有一个使用正则表达式和 split() 的更优雅的解决方案

我有一个字符串(excel 正在使用它)剪贴板),本质上是逗号分隔的。需要注意的是,当单元格值包含逗号时,整个单元格都会用引号引起来(大概是为了转义该字符串中的逗号)。示例字符串如下:

123,12,"12,345",834,54,"1,111","98,273","1,923,002",23,"1,243"

现在,我想优雅地将这个字符串拆分为单独的单元格,但问题是我不能使用以逗号作为分隔符的普通拆分表达式,因为它会分割值中包含逗号的单元格。看待这个问题的另一种方式是,如果逗号前面有偶数个引号,我可以ONLY在逗号上拆分。

这很容易用循环解决,但我想知道是否有正则表达式.split 函数能够捕获此逻辑。为了解决这个问题,我为逻辑构建了确定性有限自动机(DFA)。

alt text

现在的问题简化为以下内容:有没有办法分割这个字符串,以便一个新的数组元素(每次在 DFA 中达到最终状态(此处为状态 4)时都会生成(对应于 /s)?

I have a problem which has an solution that can be solved by iteration, but I'm wondering if there's a more elegant solution using regular expressions and split()

I have a string (which excel is putting on the clipboard), which is, in essence, comma delimited. The caveat is that when the cell values contain a comma, the whole cell is surrounded with quotation marks (presumably to escape the commas within that string). An example string is as follows:

123,12,"12,345",834,54,"1,111","98,273","1,923,002",23,"1,243"

Now, I want to elegantly split this string into individual cells, but the catch is I cannot use a normal split expression with comma as a delimiter, because it will divide cells that contain a comma in their value. Another way of looking at this problem, is that I can ONLY split on a comma if there is an EVEN number of quotation marks preceding the comma.

This is easy to solve with a loop, but I'm wondering if there's a regular expression.split function capable of capturing this logic. In an attempt to solve this problem, I constructed the Deterministic Finite Automata (DFA) for the logic.

alt text

The question now is reduced to the following: is there a way to split this string such that a new array element (corresponding to /s) is produced each time the final state (state 4 here) is reached in a DFA?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

旧城烟雨 2024-10-15 03:11:59

使用正则表达式(未转义): (?:(?:"[^"]*")|(?:[^,]*))

使用它并调用 Regex.Matches() ,即 . NET 或其他平台中的类似内容,

您可以将上面的内容进一步扩展为: ^(?:(?:"(?[^"]*)")|(?; [^,]*))(?:,(?:(?:"(?<值>[^"]*)")|(?<值>[^,]*)))*$< /code>

这将一次性解析整个字符串,但您需要命名组和每组的多重捕获才能正常工作(.NET 支持)。

Using regex (unescaped): (?:(?:"[^"]*")|(?:[^,]*))

Use that and call Regex.Matches() which is .NET, or its analog in other platforms.

You could further expand the above to this: ^(?:(?:"(?<Value>[^"]*)")|(?<Value>[^,]*))(?:,(?:(?:"(?<Value>[^"]*)")|(?<Value>[^,]*)))*$

This will parse the whole string in 1 shot, but you need named groups and multi-capture per group for this to work (.NET supports it).

小帐篷 2024-10-15 03:11:59

合格的逗号还带有偶数个引号,并且 VBScript 确实支持前视。尝试对此进行拆分:

",(?=(?:[^""]*""[^""]*"")*[^""]*$)"

Eligible commas are also followed by an even number of quotes, and VBScript does support lookaheads. Try splitting on this:

",(?=(?:[^""]*""[^""]*"")*[^""]*$)"
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文