对于各种链接和 href 分隔符(" 和 ') 的正则表达式有点头疼
因此,我想将以下链接结构与 php 中的 preg_match_all 进行匹配。
<a garbage href="http://this.is.a.link.com/?query=this has invalid spaces" possible garbage>
<a garbage href='http://this.is.a.link.com/?query=this also has has invalid spaces' possible garbage>
<a garbage href=http://this.is.a.link.com/?query=no_spaces_but_no_delimiters possible garbage>
<a garbage href=http://this.is.a.link.com/?query=no_spaces_but_no_delimiters>
我可以通过执行以下操作来获取“和 ' 限定网址
'#<a[^>]*?href=("|\')(.*?)("|\')#is'
,或者我可以获得全部 3 个,但如果前两个中有空格则不能:
'#<a[^>]*?href=("|\')?(.*?)[\s\"\'>]#is'
我怎样才能获得制定此格式,以便它会拾取用潜在空格分隔的“和”,但也会拾取正确编码的没有分隔符的 URL。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
好的,这似乎有效:
($matches[1] 包含网址)
唯一的烦恼是带引号的网址仍然带有引号,因此您必须将它们删除:
OK, this seems to work:
($matches[1] contains the urls)
Only annoyance is that quoted urls have the quotes still on, so you'll have to strip them off:
编辑:我已对其进行了编辑,使其比我最初发布的效果更好一些。
您几乎在第二个正则表达式中拥有它:
返回以下数组:
带或不带分隔符均可使用。
EDIT: I have edited this to work a little better than I originally posted.
You almost have it in the second regex:
Returns the following array:
Works with or without delimiters.
使用 DOM 解析器。您无法使用正则表达式解析 (x)HTML。
Use a DOM parser. You cannot parse (x)HTML with regular expressions.
当您说要匹配它们时,您是否试图从链接中提取信息,或者只是查找带有 href 的超链接?如果您只追求后者,那么这应该可以正常工作:
When you say you want to match them, are you trying to extract information out of the links, or simply find hyperlinks with a href? If you're after only the latter, this should work just fine:
正如 @JasonWoof 所指出的,您需要使用嵌入式替换:一种用于引用 URL 的替代方案,另一种用于非引用 URL。我还建议使用捕获组来确定正在使用哪种引用,就像 @DanHorrigan 所做的那样。通过添加负向前瞻 (
(?!\\2)
) 和所有格量词 (*+
),您可以创建一个高度健壮的正则表达式,而且速度非常快:在 ideone 上查看它的运行情况。 (双反斜杠是因为正则表达式是这样写的以 PHP Heredoc 的形式,我更喜欢使用 nowdoc,但 ideone 显然仍在运行 PHP 5.2。)
As @JasonWoof indicated, you need to use an embedded alternation: one alternative for quoted URLs, one for non-quoted. I also recommend using a capturing group to determine which kind of quote is being used, as @DanHorrigan did. With the addition of a negative lookahead (
(?!\\2)
) and possessive quantifiers (*+
), you can create a highly robust regex that is also very quick:See it in action on ideone. (The doubled backslashes are because the regex is written in the form of a PHP heredoc. I'd prefer to use a nowdoc, but ideone is apparently still running PHP 5.2.)