如何使用正则表达式来忽略包含特定子字符串的字符串?

发布于 2024-07-13 00:20:11 字数 1927 浏览 11 评论 0原文

我将如何使用负向后查找(或任何其他方法)正则表达式来忽略包含特定子字符串的字符串?

我读过之前的两个 stackoverflow 问题:
java-regexp-for-file-filtering
正则表达式-to-match-against-something- that-is-not-a-specific-substring

它们几乎是我想要的...我的问题是字符串没有以我想忽略的内容结尾。 如果这样做的话,这就不成问题了。

我有一种感觉,这与环视为零宽度以及第二次通过字符串时某些内容匹配的事实有关...... 但是,我不太确定内部结构。

无论如何,如果有人愿意花时间解释它,我将不胜感激。

这是我想忽略的输入字符串的示例:

192.168.1.10 - - [08/Feb/2009:16:33:54 -0800] "GET /FOO/BAR/ HTTP/1.1" 200 2246

这是一个我想保留以供进一步评估的输入字符串示例:

192.168.1.10 - - [08/Feb/2009:16:33:54 -0800] "GET /FOO/BAR/content.js HTTP/1.1" 200 2246

对我来说,关键是我想忽略文档根默认页面之后的任何 HTTP GET。

以下是我的小测试工具和迄今为止我想出的最好的正则表达式。

public static void main(String[] args){
String inString = "192.168.1.10 - - [08/Feb/2009:16:33:54 -0800] \"GET /FOO/BAR/ HTTP/1.1\" 200 2246";
//String inString = "192.168.1.10 - - [08/Feb/2009:16:33:54 -0800] \"GET /FOO/BAR/content.js HTTP/1.1\" 200 2246";
//String inString = "192.168.1.10 - - [08/Feb/2009:16:33:54 -0800] \"GET /FOO/BAR/content.js HTTP/"; // This works
//String inString = "192.168.1.10 - - [08/Feb/2009:16:33:54 -0800] \"GET /FOO/BAR/ HTTP/"; // This works
String inRegEx = "^.*(?:GET).*$(?<!.?/ HTTP/)";
try {
  Pattern pattern = Pattern.compile(inRegEx);

  Matcher matcher = pattern.matcher(inString);

  if (matcher.find()) {
    System.out.printf("I found the text \"%s\" starting at " +
"index %d and ending at index %d.%n",
matcher.group(), matcher.start(), matcher.end());
  } else {
    System.out.printf("No match found.%n");
  }
} catch (PatternSyntaxException pse) {
  System.out.println("Invalid RegEx: " + inRegEx);
  pse.printStackTrace();
}
}

How would I go about using a negative lookbehind(or any other method) regular expression to ignore strings that contains a specific substring?

I've read two previous stackoverflow questions:
java-regexp-for-file-filtering
regex-to-match-against-something-that-is-not-a-specific-substring

They are nearly what I want... my problem is the string doesn't end with what I want to ignore. If it did this would not be a problem.

I have a feeling this has to do with the fact that lookarounds are zero-width and something is matching on the second pass through the string...
but, I'm none too sure of the internals.

Anyway, if anyone is willing to take the time and explain it I will greatly appreciate it.

Here is an example of an input string that I want to ignore:

192.168.1.10 - - [08/Feb/2009:16:33:54 -0800] "GET /FOO/BAR/ HTTP/1.1" 200 2246

Here is an example of an input string that I want to keep for further evaluation:

192.168.1.10 - - [08/Feb/2009:16:33:54 -0800] "GET /FOO/BAR/content.js HTTP/1.1" 200 2246

The key for me is that I want to ignore any HTTP GET that is going after a document root default page.

Following is my little test harness and the best RegEx I've come up with so far.

public static void main(String[] args){
String inString = "192.168.1.10 - - [08/Feb/2009:16:33:54 -0800] \"GET /FOO/BAR/ HTTP/1.1\" 200 2246";
//String inString = "192.168.1.10 - - [08/Feb/2009:16:33:54 -0800] \"GET /FOO/BAR/content.js HTTP/1.1\" 200 2246";
//String inString = "192.168.1.10 - - [08/Feb/2009:16:33:54 -0800] \"GET /FOO/BAR/content.js HTTP/"; // This works
//String inString = "192.168.1.10 - - [08/Feb/2009:16:33:54 -0800] \"GET /FOO/BAR/ HTTP/"; // This works
String inRegEx = "^.*(?:GET).*$(?<!.?/ HTTP/)";
try {
  Pattern pattern = Pattern.compile(inRegEx);

  Matcher matcher = pattern.matcher(inString);

  if (matcher.find()) {
    System.out.printf("I found the text \"%s\" starting at " +
"index %d and ending at index %d.%n",
matcher.group(), matcher.start(), matcher.end());
  } else {
    System.out.printf("No match found.%n");
  }
} catch (PatternSyntaxException pse) {
  System.out.println("Invalid RegEx: " + inRegEx);
  pse.printStackTrace();
}
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

江湖正好 2024-07-20 00:20:11

您可以匹配任何不以 / 结尾的路径吗?

String inRegEx = "^.* \"GET (.*[^/]) HTTP/.*$";

这也可以使用负向后查找来完成

String inRegEx = "^.* \"GET (.+)(?<!/) HTTP/.*$";

,这里,(? 表示“前面的序列必须匹配/”。

Could you just match any path that doesn't end with a /

String inRegEx = "^.* \"GET (.*[^/]) HTTP/.*$";

This can also be done using negative lookbehind

String inRegEx = "^.* \"GET (.+)(?<!/) HTTP/.*$";

Here, (?<!/) says "the preceding sequence must not match /".

丑丑阿 2024-07-20 00:20:11

也许我在这里遗漏了一些东西,但是您不能不使用任何正则表达式并忽略任何符合此条件的内容吗:

string.contains("/ HTTP")

因为文件路径永远不会以斜杠结尾。

Maybe I'm missing something here, but couldn't you just go without any regular expression and ignore anything for which this is true:

string.contains("/ HTTP")

Because a file path will never end with a slash.

软甜啾 2024-07-20 00:20:11

我会使用这样的东西:

"\"GET /FOO/BAR/[^ ]+ HTTP/1\.[01]\""

这匹配不仅仅是 /FOO/BAR/ 的每个路径。

I would use something like this:

"\"GET /FOO/BAR/[^ ]+ HTTP/1\.[01]\""

This matches every path that’s not just /FOO/BAR/.

情深已缘浅 2024-07-20 00:20:11

如果您正在编写如此复杂的 Regex,我建议您在 StackOverflow 之外构建一个资源库。

If you are writing Regex this complex, I would recommend building a library of resources outside of StackOverflow.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文