正则表达式查找一些 .php 文件
我正在尝试为爬虫程序制作排除正则表达式。我想索引出现在 /archives/
目录中的所有 .php
文件,但不在其他地方。因此,正则表达式应该匹配所有 .php
文件,除了 /archives/
目录中的文件(无论嵌套有多深)。因此,例如,它会索引
www.mysite.com/archives/123qwe/index.php
,但不会索引
www.mysite.com/123qwe/index.php
我相信这个正则表达式应该起作用: (?
但是,我无法使用<
字符,因为我需要将正则表达式提交到 Web 表单中,以清理输入中的 <
字符。并且使用 <
会破坏正则表达式。那么有没有另一种方法可以形成这个正则表达式,而不需要 <
?
I'm trying to make an exclusion regex for a crawler. I want to index all the .php
files that appear in the /archives/
directory, but not anywhere else. So the regex should match all .php
files, except those that are in an /archives/
directory (however deeply nested). So, for example, it will index
www.mysite.com/archives/123qwe/index.php
but not
www.mysite.com/123qwe/index.php
I believe this regex should work: (?<!\/archives\/.*)\.php$
However, I'm not able to use the <
character, because I need to submit the regex into a web form that sanitizes <
's from the input. And using <
breaks the regex. So is there another way to form this regex, without needing the <
?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
怎么样
这是一个消极的前瞻而不是你的消极的后瞻。如果字符串中没有
/magazine/
并且以.php
结尾,则此正则表达式应该匹配这与您的方法非常相似,但没有
<< /代码>。
您可以在 Regexr 上查看它的实际效果
What about
This is a negative look ahead instead your negative lookbehind. This regex should match if there is no
/magazine/
in the string and it ends with.php
Thats very similar to your approach, but without the
<
.You can see it in action here on Regexr
试试这个:
或者,更清楚地说:
当您创建正则表达式并且不确定如何继续时,lookbehinds 永远不应该是您使用的第一个工具。事实上,我倾向于将它们视为最后的手段。它们只是没有足够的用处来抵消它们引入的复杂性。
Try this:
Or, more legibly:
When you're creating a regex and you're not sure how to proceed, lookbehinds should never be the first tool you reach for. In fact, I tend to regard them as a last resort. They're just not useful enough to offset the complexity they introduce.
难道你不能贪婪地在正则表达式中指定你想要存档吗?
Couldn't you just be greedy and specify that you want archive in your regular expression?