Regex.Split 如何给我重叠的匹配项?

发布于 2024-10-08 09:03:00 字数 756 浏览 0 评论 0原文

我有一个很大的正则表达式,用于解析我自己的文件格式,类似于 lua。这工作得很好,除了引号内的数字会匹配两次,即使 split 不应该返回重叠的结果。我已将其简化为这个控制台应用程序。有什么想法吗?

static void Main(string[] args)
{
    string pattern = "(\r\n)|(\"(.*)\")"; // Splits at \r\n and anything in "quotes"

    string input = "\"01\"\r\n" + // "01"
                   "\"02\"\r\n" + // "02"
                   "\"03\"\r\n";  // "03"

    string[] results = Regex.Split(input, pattern );
    foreach (string result in results )
    {
            //This just filters out the split \r\n and empty strings in results
            if (string.IsNullOrWhiteSpace(result) == false) 
                Console.WriteLine(result);
    }
    Console.ReadLine();
}

返回:

"01"
01
"02"
02
"03"
03

I have a large Regular Expression I use for parsing my own file format similar to lua. This works fine, except somehow numbers inside quotes get matched twice even though split shouldn't return overlapping results. I've simplified it down to this console app. Any ideas?

static void Main(string[] args)
{
    string pattern = "(\r\n)|(\"(.*)\")"; // Splits at \r\n and anything in "quotes"

    string input = "\"01\"\r\n" + // "01"
                   "\"02\"\r\n" + // "02"
                   "\"03\"\r\n";  // "03"

    string[] results = Regex.Split(input, pattern );
    foreach (string result in results )
    {
            //This just filters out the split \r\n and empty strings in results
            if (string.IsNullOrWhiteSpace(result) == false) 
                Console.WriteLine(result);
    }
    Console.ReadLine();
}

Returns:

"01"
01
"02"
02
"03"
03

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

红ご颜醉 2024-10-15 09:03:00

来自文档

如果在 Regex.Split 表达式中使用捕获括号,则任何捕获的文本都将包含在结果字符串数组中。例如,在捕获括号内的连字符上分割字符串“plum-pear”,会将包含连字符的字符串元素添加到返回的数组中。

您有两组捕获括号,一组包含引号,一组不包含引号。这些返回您所看到的字符串。

请注意,RegEx.Split 的模式不应与所需结果匹配,而应与分隔符匹配。带引号的字符串通常不是分隔符。

另外,你的结果看起来很奇怪,因为你使用了贪婪匹配。显然,要求“输入字符串被分割尽可能多次”。使整个操作的匹配变得非贪婪。

总的来说,我想说你使用了错误的工具。根据实现的不同,正则表达式无法处理嵌套分组或者效率极低。简单的 DFA 应该工作得更好,并且只需要一次扫描。

From the documentation:

If capturing parentheses are used in a Regex.Split expression, any captured text is included in the resulting string array. For example, splitting the string " plum-pear" on a hyphen placed within capturing parentheses adds a string element that contains the hyphen to the returned array.

You have two sets of capturing parenthesis, one inclusive of the quotes and one exclusive. These return the strings you are seeing.

Note that the pattern for RegEx.Split isn't supposed to match the desired results, it's supposed to match the delimiters. A quoted string is usually not a delimiter.

Also, your results seem very odd, because you've used a greedy match. Apparently the requirement "The input string is split as many times as possible." makes matching non-greedy for the entire operation.

Overall, I'd say you're using the wrong tool. Regular expressions are, depending on implementation, incapable of dealing with nested groupings or extremely inefficient. A simple DFA should work much better and never need more than a single scan.

爱的故事 2024-10-15 09:03:00

只需删除外面的括号,

string pattern = "(\r\n)|\"(.*)\"";

//Tested output:
01
02
03

just remove the outer parentheses,

string pattern = "(\r\n)|\"(.*)\"";

//Tested output:
01
02
03
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文