如何获得正则表达式的逆表达式？

发布于 2024-07-27 22:22:29 字数 158 浏览 18 评论 0原文

假设我有一个正则表达式，可以正确查找文本文件中的所有 URL：

(http://)([a-zA-Z0-9\/\.])*

如果我想要的不是 URL，而是相反的内容（除了 URL 之外的所有其他文本），是否有一个简单的修改可以实现此目的？

原文

Let's say I have a regular expression that works correctly to find all of the URLs in a text file:

(http://)([a-zA-Z0-9\/\.])*

If what I want is not the URLs but the inverse - all other text except the URLs - is there an easy modification to make to get this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

莳間冲淡了誓言ζ 2024-08-03 22:22:30

如果出于某种原因您需要仅使用正则表达式的解决方案，请尝试以下操作：

((?<=http://[a-zA-Z0-9\/\.#?/%]+(?=[^a-zA-Z0-9\/\.#?/%]))|\A(?!http://[a-zA-Z0-9\/\.#?/%])).+?((?=http://[a-zA-Z0-9\/\.#?/%])|\Z)

我稍微扩展了 URL 字符集 ([a-zA-Z0-9\/\.#?/%] ）包括一些重要的内容，但这绝不意味着准确或详尽。

正则表达式有点像怪物，所以我会尝试将其分解：

(?<=http://[a-zA-Z0-9\/\.#?/%]+(?=[^a-zA-Z0-9\/\.#?/%])

第一个部分与 URL 的末尾匹配。 http://[a-zA-Z0-9\/\.#?/%]+ 匹配 URL 本身，而 (?=[^a-zA-Z0-9 \\.#?/%]) 断言 URL 后面必须跟有非 URL 字符，以便我们确定到达末尾。使用前瞻以便查找非 URL 字符但不捕获该非 URL 字符。整个事情被包装在一个lookbehind (?<=...)中来寻找它作为匹配的边界，同样不捕获该部分。

我们还想匹配文件开头的非 URL。 \A(?!http://[a-zA-Z0-9\/\.#?/%]) 匹配文件的开头 (\A )，然后进行否定前瞻，以确保文件开头没有潜伏的 URL。（这个 URL 检查比第一个简单，因为我们只需要 URL 的开头，而不是整个内容。）

这两个检查都放在括号中，并与 OR 一起使用。 >| 字符。之后，.+? 与我们尝试捕获的字符串匹配。

然后我们来到((?=http://[a-zA-Z0-9\/\.#?/%])|\Z)。在这里，我们再次使用 (?=http://[a-zA-Z0-9\/\.#?/%]) 检查 URL 的开头。文件结尾也是一个很好的迹象，表明我们已经到达匹配的结尾，因此我们也应该使用 \Z 来查找它。与第一个大组类似，我们将其括在括号中，并将两种可能性放在一起。

| 符号需要括号，因为它的优先级非常低，因此您必须明确说明 OR 的边界。

此正则表达式严重依赖于零宽度断言（\A 和 \Z 锚点以及环视组）。在将正则表达式用于任何严重或永久的事情之前，您应该始终了解它（否则您可能会遇到 perl 的情况），因此您可能需要查看字符串开始和字符串结束锚点和前向和后向零宽度断言。

当然，欢迎指正！

If for some reason you need a regex-only solution, try this:

((?<=http://[a-zA-Z0-9\/\.#?/%]+(?=[^a-zA-Z0-9\/\.#?/%]))|\A(?!http://[a-zA-Z0-9\/\.#?/%])).+?((?=http://[a-zA-Z0-9\/\.#?/%])|\Z)

I expanded the set of of URL characters a little ([a-zA-Z0-9\/\.#?/%]) to include a few important ones, but this is by no means meant to be exact or exhaustive.

The regex is a bit of a monster, so I'll try to break it down:

(?<=http://[a-zA-Z0-9\/\.#?/%]+(?=[^a-zA-Z0-9\/\.#?/%])

The first potion matches the end of a URL. http://[a-zA-Z0-9\/\.#?/%]+ matches the URL itself, while (?=[^a-zA-Z0-9\/\.#?/%]) asserts that the URL must be followed by a non-URL character so that we are sure we are at the end. A lookahead is used so that the non-URL character is sought but not captured. The whole thing is wrapped in a lookbehind (?<=...) to look for it as the boundary of the match, again without capturing that portion.

We also want to match a non-URL at the beginning of the file. \A(?!http://[a-zA-Z0-9\/\.#?/%]) matches the beginning of the file (\A), followed by a negative lookahead to make sure there's not a URL lurking at the start of the file. (This URL check is simpler than the first one because we only need the beginning of the URL, not the whole thing.)

Both of those checks are put in parenthesis and OR'd together with the | character. After that, .+? matches the string we are trying to capture.

Then we come to ((?=http://[a-zA-Z0-9\/\.#?/%])|\Z). Here, we check for the beginning of a URL, once again with (?=http://[a-zA-Z0-9\/\.#?/%]). The end of the file is also a pretty good sign that we've reached the end of our match, so we should look for that, too, using \Z. Similarly to a first big group, we wrap it in parenthesis and OR the two possibilities together.

The | symbol requires the parenthesis because its precedence is very low, so you have to explicitly state the boundaries of the OR.

This regex relies heavily on zero-width assertions (the \A and \Z anchors, and the lookaround groups). You should always understand a regex before you use it for anything serious or permanent (otherwise you might catch a case of perl), so you might want to check out Start of String and End of String Anchors and Lookahead and Lookbehind Zero-Width Assertions.

Corrections welcome, of course!

回复收藏 0 原文

锦上情书 2024-08-03 22:22:30

如果我正确理解了问题，您可以使用搜索/替换...只需在表达式周围使用通配符，然后替换第一部分和最后一部分。

s/^(.*)(your regex here)(.*)$/$1$3/

If I understand the question correctly, you can use search/replace...just wildcard around your expression and then substitute the first and last parts.