正则表达式查找一些 .php 文件

发布于 2024-11-03 02:03:39 字数 568 浏览 0 评论 0原文

我正在尝试为爬虫程序制作排除正则表达式。我想索引出现在 /archives/ 目录中的所有 .php 文件，但不在其他地方。因此，正则表达式应该匹配所有 .php 文件，除了 /archives/ 目录中的文件（无论嵌套有多深）。因此，例如，它会索引

www.mysite.com/archives/123qwe/index.php

，但不会索引

www.mysite.com/123qwe/index.php

我相信这个正则表达式应该起作用： (?

但是，我无法使用< 字符，因为我需要将正则表达式提交到 Web 表单中，以清理输入中的 < 字符。并且使用 < 会破坏正则表达式。那么有没有另一种方法可以形成这个正则表达式，而不需要 < ？

原文

I'm trying to make an exclusion regex for a crawler. I want to index all the .php files that appear in the /archives/ directory, but not anywhere else. So the regex should match all .php files, except those that are in an /archives/ directory (however deeply nested). So, for example, it will index

www.mysite.com/archives/123qwe/index.php

but not

www.mysite.com/123qwe/index.php

I believe this regex should work: (?<!\/archives\/.*)\.php$

However, I'm not able to use the < character, because I need to submit the regex into a web form that sanitizes <'s from the input. And using < breaks the regex. So is there another way to form this regex, without needing the <?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

咆哮 2024-11-10 02:03:39

怎么样

(?!.*\/magazine\/)(?:^.*\.php$)

这是一个消极的前瞻而不是你的消极的后瞻。如果字符串中没有 /magazine/ 并且以 .php 结尾，则此正则表达式应该匹配

这与您的方法非常相似，但没有 << /代码>。

您可以在 Regexr 上查看它的实际效果

What about

(?!.*\/magazine\/)(?:^.*\.php$)

This is a negative look ahead instead your negative lookbehind. This regex should match if there is no /magazine/ in the string and it ends with .php

Thats very similar to your approach, but without the <.

You can see it in action here on Regexr

回复收藏 0 原文

嘿嘿嘿 2024-11-10 02:03:39

试试这个：

^www\.mysite\.com(?:/(?!archives/)[^/.]+)+\.php$

或者，更清楚地说：

^www\.mysite\.com
(?:
  /               # After consuming the `/`...
  (?!archives/)   # if the next name isn't `archives`...
  [^/.]+          # consume it. 
)+                # Repeat as needed.
\.php$

当您创建正则表达式并且不确定如何继续时，lookbehinds 永远不应该是您使用的第一个工具。事实上，我倾向于将它们视为最后的手段。它们只是没有足够的用处来抵消它们引入的复杂性。

Try this:

^www\.mysite\.com(?:/(?!archives/)[^/.]+)+\.php$

Or, more legibly:

^www\.mysite\.com
(?:
  /               # After consuming the `/`...
  (?!archives/)   # if the next name isn't `archives`...
  [^/.]+          # consume it. 
)+                # Repeat as needed.
\.php$

When you're creating a regex and you're not sure how to proceed, lookbehinds should never be the first tool you reach for. In fact, I tend to regard them as a last resort. They're just not useful enough to offset the complexity they introduce.

回复收藏 0 原文