如何使用正则表达式来忽略包含特定子字符串的字符串?
我将如何使用负向后查找(或任何其他方法)正则表达式来忽略包含特定子字符串的字符串?
我读过之前的两个 stackoverflow 问题:
java-regexp-for-file-filtering
正则表达式-to-match-against-something- that-is-not-a-specific-substring
它们几乎是我想要的...我的问题是字符串没有以我想忽略的内容结尾。 如果这样做的话,这就不成问题了。
我有一种感觉,这与环视为零宽度以及第二次通过字符串时某些内容匹配的事实有关...... 但是,我不太确定内部结构。
无论如何,如果有人愿意花时间解释它,我将不胜感激。
这是我想忽略的输入字符串的示例:
192.168.1.10 - - [08/Feb/2009:16:33:54 -0800] "GET /FOO/BAR/ HTTP/1.1" 200 2246
这是一个我想保留以供进一步评估的输入字符串示例:
192.168.1.10 - - [08/Feb/2009:16:33:54 -0800] "GET /FOO/BAR/content.js HTTP/1.1" 200 2246
对我来说,关键是我想忽略文档根默认页面之后的任何 HTTP GET。
以下是我的小测试工具和迄今为止我想出的最好的正则表达式。
public static void main(String[] args){
String inString = "192.168.1.10 - - [08/Feb/2009:16:33:54 -0800] \"GET /FOO/BAR/ HTTP/1.1\" 200 2246";
//String inString = "192.168.1.10 - - [08/Feb/2009:16:33:54 -0800] \"GET /FOO/BAR/content.js HTTP/1.1\" 200 2246";
//String inString = "192.168.1.10 - - [08/Feb/2009:16:33:54 -0800] \"GET /FOO/BAR/content.js HTTP/"; // This works
//String inString = "192.168.1.10 - - [08/Feb/2009:16:33:54 -0800] \"GET /FOO/BAR/ HTTP/"; // This works
String inRegEx = "^.*(?:GET).*$(?<!.?/ HTTP/)";
try {
Pattern pattern = Pattern.compile(inRegEx);
Matcher matcher = pattern.matcher(inString);
if (matcher.find()) {
System.out.printf("I found the text \"%s\" starting at " +
"index %d and ending at index %d.%n",
matcher.group(), matcher.start(), matcher.end());
} else {
System.out.printf("No match found.%n");
}
} catch (PatternSyntaxException pse) {
System.out.println("Invalid RegEx: " + inRegEx);
pse.printStackTrace();
}
}
How would I go about using a negative lookbehind(or any other method) regular expression to ignore strings that contains a specific substring?
I've read two previous stackoverflow questions:
java-regexp-for-file-filtering
regex-to-match-against-something-that-is-not-a-specific-substring
They are nearly what I want... my problem is the string doesn't end with what I want to ignore. If it did this would not be a problem.
I have a feeling this has to do with the fact that lookarounds are zero-width and something is matching on the second pass through the string...
but, I'm none too sure of the internals.
Anyway, if anyone is willing to take the time and explain it I will greatly appreciate it.
Here is an example of an input string that I want to ignore:
192.168.1.10 - - [08/Feb/2009:16:33:54 -0800] "GET /FOO/BAR/ HTTP/1.1" 200 2246
Here is an example of an input string that I want to keep for further evaluation:
192.168.1.10 - - [08/Feb/2009:16:33:54 -0800] "GET /FOO/BAR/content.js HTTP/1.1" 200 2246
The key for me is that I want to ignore any HTTP GET that is going after a document root default page.
Following is my little test harness and the best RegEx I've come up with so far.
public static void main(String[] args){
String inString = "192.168.1.10 - - [08/Feb/2009:16:33:54 -0800] \"GET /FOO/BAR/ HTTP/1.1\" 200 2246";
//String inString = "192.168.1.10 - - [08/Feb/2009:16:33:54 -0800] \"GET /FOO/BAR/content.js HTTP/1.1\" 200 2246";
//String inString = "192.168.1.10 - - [08/Feb/2009:16:33:54 -0800] \"GET /FOO/BAR/content.js HTTP/"; // This works
//String inString = "192.168.1.10 - - [08/Feb/2009:16:33:54 -0800] \"GET /FOO/BAR/ HTTP/"; // This works
String inRegEx = "^.*(?:GET).*$(?<!.?/ HTTP/)";
try {
Pattern pattern = Pattern.compile(inRegEx);
Matcher matcher = pattern.matcher(inString);
if (matcher.find()) {
System.out.printf("I found the text \"%s\" starting at " +
"index %d and ending at index %d.%n",
matcher.group(), matcher.start(), matcher.end());
} else {
System.out.printf("No match found.%n");
}
} catch (PatternSyntaxException pse) {
System.out.println("Invalid RegEx: " + inRegEx);
pse.printStackTrace();
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您可以匹配任何不以
/
结尾的路径吗?这也可以使用负向后查找来完成
,这里,
(? 表示“前面的序列必须不匹配
/
”。Could you just match any path that doesn't end with a
/
This can also be done using negative lookbehind
Here,
(?<!/)
says "the preceding sequence must not match/
".也许我在这里遗漏了一些东西,但是您不能不使用任何正则表达式并忽略任何符合此条件的内容吗:
因为文件路径永远不会以斜杠结尾。
Maybe I'm missing something here, but couldn't you just go without any regular expression and ignore anything for which this is true:
Because a file path will never end with a slash.
我会使用这样的东西:
这匹配不仅仅是
/FOO/BAR/
的每个路径。I would use something like this:
This matches every path that’s not just
/FOO/BAR/
.如果您正在编写如此复杂的 Regex,我建议您在 StackOverflow 之外构建一个资源库。
If you are writing Regex this complex, I would recommend building a library of resources outside of StackOverflow.