正则表达式问题:无法匹配可变长度模式
我的正则表达式有问题,使用 preg_match_all() 来匹配可变长度的内容。
我试图匹配的是“拥塞”一词之后的交通状况我想出的是这个正则表达式模式:
Congestion\s*:\s*(?P
然而,它会提取第一个实例一直到整个主题的末尾,因为 .* 将匹配所有内容。但这不是我想要的,我希望它作为 3 个实例分别匹配。
现在,由于“拥塞”背后的单词可能具有可变长度,因此我无法真正预测之间有多少单词和空格才能得出更严格的 \w*\s*\w* 匹配等。
有关如何进行的任何线索从这里开始?
Highway : Highway 26
Datetime : 18-Oct-2010 05:18 PM
Congestion : Traffic is slow from Smith St to Alice Springs St
Highway : Princes Highway
Datetime : 18-Oct-2010 05:18 PM
Congestion : Traffic is slow at the Flinders St / Elizabeth St intersection
Highway : Eastern Freeway
Datetime : 18-Oct-2010 05:19 PM
Congestion : Traffic is slow from Prince St to Queen St
为了清晰而编辑
这些格式非常好的文本实际上是通过格式非常糟糕的 html 电子邮件收到的。它到处包含随机换行符,例如“拥堵:从 Prince\nSt 到 Queen St 的交通\n 很慢”。
因此,在处理电子邮件时,我剥离了所有 html 代码和随机换行符,并将它们 json_encode() 成一个非常长的单行字符串,没有换行符......
I have a problem with regex, using preg_match_all(), to match something of a variable length.
What I am trying to match is the traffic condition after the word 'Congestion' What I came up with is this regex pattern:
Congestion\s*:\s*(?P<congestion>.*)
It would however, extract the first instance all the way to the end of the entire subject, since .* would match everything. But that's not what I want though, I would like it to match separately as 3 instances.
Now since the words behind Congestion could be of variable length, I can't really predict how many words and spaces are in between to come up with a stricter \w*\s*\w* match etc.
Any clues on how I can proceed from here?
Highway : Highway 26
Datetime : 18-Oct-2010 05:18 PM
Congestion : Traffic is slow from Smith St to Alice Springs St
Highway : Princes Highway
Datetime : 18-Oct-2010 05:18 PM
Congestion : Traffic is slow at the Flinders St / Elizabeth St intersection
Highway : Eastern Freeway
Datetime : 18-Oct-2010 05:19 PM
Congestion : Traffic is slow from Prince St to Queen St
EDIT FOR CLARITY
These very nicely formatted texts here, are actually received via a very poorly formatted html email. It contains random line breaks here and there eg "Congestion : Traffic\n is slow from Prince\nSt to Queen St".
So while processing the emails, I stripped off all the html codes and the random line breaks, and json_encode() them into one very long single-line string with no line break...
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
通常,正则表达式匹配是基于行的。正则表达式假定您的字符串是单行。您可以使用 “
m
” ( PCRE_MULTILINE)标志来更改该行为。然后你可以告诉 PHP 只匹配行尾:有两件事需要注意:首先,模式被修改为包括行开始(
^
)和行结束($
) 标记。其次,该模式现在带有m
修饰符。Usually, regex matching is line-based. Regex assumes that your string is a single line. You can use the “
m
” (PCRE_MULTILINE) flag to change that behaviour. Then you can tell PHP to match only to the end of the line:There are two things to notice: first, the pattern was modified to include line-begin (
^
) and line-end ($
) markers. Secondly, the pattern now carries them
modifier.您可以尝试最小匹配:
Congestion\s*:\s*(?P.*?)
这将导致在命名组“congestion”中返回零个字符,除非您可以匹配紧接在拥塞字符串之后的东西。
因此,如果“高速公路”始终启动交通状况记录,则可以修复此问题:
Congestion\s*:\s*(?P.*?)Highway\s*:
如果有效(我没有检查过),那么第一条记录匹配,但最后一条记录不匹配!通过在输入字符串末尾附加文本“Highway:”可以轻松解决此问题。
You can try a minimal match:
Congestion\s*:\s*(?P<congestion>.*?)
This would result in returning zero characters in the named group 'congestion' unless you could match something immediately after the congestion string.
So, this could be fixed if "Highway" always starts the traffic condition records:
Congestion\s*:\s*(?P<congestion>.*?)Highway\s*:
If this works (I have not checked it), then the first records are matched but the last record is not! This could be easily fixed by appending the text 'Highway :' at the end of the input string.