Regex.Split 如何给我重叠的匹配项?
我有一个很大的正则表达式,用于解析我自己的文件格式,类似于 lua。这工作得很好,除了引号内的数字会匹配两次,即使 split 不应该返回重叠的结果。我已将其简化为这个控制台应用程序。有什么想法吗?
static void Main(string[] args)
{
string pattern = "(\r\n)|(\"(.*)\")"; // Splits at \r\n and anything in "quotes"
string input = "\"01\"\r\n" + // "01"
"\"02\"\r\n" + // "02"
"\"03\"\r\n"; // "03"
string[] results = Regex.Split(input, pattern );
foreach (string result in results )
{
//This just filters out the split \r\n and empty strings in results
if (string.IsNullOrWhiteSpace(result) == false)
Console.WriteLine(result);
}
Console.ReadLine();
}
返回:
"01"
01
"02"
02
"03"
03
I have a large Regular Expression I use for parsing my own file format similar to lua. This works fine, except somehow numbers inside quotes get matched twice even though split shouldn't return overlapping results. I've simplified it down to this console app. Any ideas?
static void Main(string[] args)
{
string pattern = "(\r\n)|(\"(.*)\")"; // Splits at \r\n and anything in "quotes"
string input = "\"01\"\r\n" + // "01"
"\"02\"\r\n" + // "02"
"\"03\"\r\n"; // "03"
string[] results = Regex.Split(input, pattern );
foreach (string result in results )
{
//This just filters out the split \r\n and empty strings in results
if (string.IsNullOrWhiteSpace(result) == false)
Console.WriteLine(result);
}
Console.ReadLine();
}
Returns:
"01"
01
"02"
02
"03"
03
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
来自文档:
您有两组捕获括号,一组包含引号,一组不包含引号。这些返回您所看到的字符串。
请注意,
RegEx.Split
的模式不应与所需结果匹配,而应与分隔符匹配。带引号的字符串通常不是分隔符。另外,你的结果看起来很奇怪,因为你使用了贪婪匹配。显然,要求“输入字符串被分割尽可能多次”。使整个操作的匹配变得非贪婪。
总的来说,我想说你使用了错误的工具。根据实现的不同,正则表达式无法处理嵌套分组或者效率极低。简单的 DFA 应该工作得更好,并且只需要一次扫描。
From the documentation:
You have two sets of capturing parenthesis, one inclusive of the quotes and one exclusive. These return the strings you are seeing.
Note that the pattern for
RegEx.Split
isn't supposed to match the desired results, it's supposed to match the delimiters. A quoted string is usually not a delimiter.Also, your results seem very odd, because you've used a greedy match. Apparently the requirement "The input string is split as many times as possible." makes matching non-greedy for the entire operation.
Overall, I'd say you're using the wrong tool. Regular expressions are, depending on implementation, incapable of dealing with nested groupings or extremely inefficient. A simple DFA should work much better and never need more than a single scan.
只需删除外面的括号,
just remove the outer parentheses,