如何修复我的正则表达式,使其不与贪婪量词匹配太多?
我有以下行:
"14:48 say;0ed673079715c343281355c2a1fde843;2;laka;hello ;)"
我通过使用简单的正则表达式来解析它:
if($line =~ /(\d+:\d+)\ssay;(.*);(.*);(.*);(.*)/) {
my($ts, $hash, $pid, $handle, $quote) = ($1, $2, $3, $4, $5);
}
但是 ; 最后把事情搞砸了,我不知道为什么。 贪心运算符不应该处理“一切”吗?
I have the following line:
"14:48 say;0ed673079715c343281355c2a1fde843;2;laka;hello ;)"
I parse this by using a simple regexp:
if($line =~ /(\d+:\d+)\ssay;(.*);(.*);(.*);(.*)/) {
my($ts, $hash, $pid, $handle, $quote) = ($1, $2, $3, $4, $5);
}
But the ; at the end messes things up and I don't know why. Shouldn't the greedy operator handle "everything"?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
贪婪运算符试图获取尽可能多的东西,并且仍然与字符串匹配。 发生的情况是第一个(在“say”之后)获取“0ed673079715c343281355c2a1fde843;2”,第二个获取“laka”,第三个找到“hello”,第四个匹配括号。
你需要做的是让除了最后一个之外的所有都变得非贪婪,这样它们就会尽可能少地抓取并仍然匹配字符串:
The greedy operator tries to grab as much stuff as it can and still match the string. What's happening is the first one (after "say") grabs "0ed673079715c343281355c2a1fde843;2", the second one takes "laka", the third finds "hello " and the fourth matches the parenthesis.
What you need to do is make all but the last one non-greedy, so they grab as little as possible and still match the string:
应该工作得更好
should work better
尽管正则表达式可以轻松做到这一点,但我不确定这是最直接的方法。 它可能是最短的,但这实际上并不意味着它是最可维护的。
相反,我建议这样:
这会导致:
我认为这更具可读性。 不仅如此,我认为它也更容易调试和维护,因为这更接近于人类用笔和纸尝试同样的事情时的做法。 将字符串分解为多个块,以便您可以更轻松地解析 - 让计算机完全按照您的操作进行操作。 当需要进行修改时,我认为这个会表现得更好。 YMMV。
Although a regex can easily do this, I'm not sure it's the most straight-forward approach. It's probably the shortest, but that doesn't actually make it the most maintainable.
Instead, I'd suggest something like this:
This results in:
I think this is just a bit more readable. Not only that, I think it's also easier to debug and maintain, because this is closer to how you would do it if a human were to attempt the same thing with pen and paper. Break the string down into chunks that you can then parse easier - have the computer do exactly what you would do. When it comes time to make modifications, I think this one will fare better. YMMV.
尝试使前 3 个
(.*)
变得不贪婪(.*?)
Try making the first 3
(.*)
ungreedy(.*?)
如果分号分隔列表中的值本身不能包含任何分号,则只需将其拼写出来即可获得最有效、最简单的正则表达式。 如果某些值只能是十六进制字符的字符串,请将其拼写出来。 当正则表达式与主题字符串不匹配时,使用惰性点或贪婪点的解决方案总是会导致大量无用的回溯。
If the values in your semicolon-delimited list cannot include any semicolons themselves, you'll get the most efficient and straightforward regular expression simply by spelling that out. If certain values can only be, say, a string of hex characters, spell that out. Solutions using a lazy or greedy dot will always lead to a lot of useless backtracking when the regex does not match the subject string.
您可以通过附加问号来使 * 非贪婪:
或者您可以匹配除最后一个部分之外的每个部分中除分号之外的所有内容:
You could make * non-greedy by appending a question mark:
or you can match everything except a semicolon in each part except the last: