正则表达式中的单词边界是什么？

发布于 2024-08-02 13:20:34 字数 645 浏览 12 评论 0原文

我正在尝试使用正则表达式来匹配空格分隔的数字。我找不到 \b （“单词边界”）的精确定义。我曾假设 -12 将是一个“整数单词”（与 \b\-?\d+\b 匹配），但看来这不起作用。我很高兴知道的方法。

[我在 Java 1.6 中使用 Java 正则表达式]

示例：

Pattern pattern = Pattern.compile("\\s*\\b\\-?\\d+\\s*");
String plus = " 12 ";
System.out.println("" + pattern.matcher(plus).matches());

String minus = " -12 ";
System.out.println("" + pattern.matcher(minus).matches());

pattern = Pattern.compile("\\s*\\-?\\d+\\s*");
System.out.println("" + pattern.matcher(minus).matches());

这将返回：

true
false
true

原文

I'm trying to use regexes to match space-separated numbers.
I can't find a precise definition of \b ("word boundary").
I had assumed that -12 would be an "integer word" (matched by \b\-?\d+\b) but it appears that this does not work. I'd be grateful to know of ways of .

[I am using Java regexes in Java 1.6]

Example:

Pattern pattern = Pattern.compile("\\s*\\b\\-?\\d+\\s*");
String plus = " 12 ";
System.out.println("" + pattern.matcher(plus).matches());

String minus = " -12 ";
System.out.println("" + pattern.matcher(minus).matches());

pattern = Pattern.compile("\\s*\\-?\\d+\\s*");
System.out.println("" + pattern.matcher(minus).matches());

This returns:

true
false
true

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

与风相奔跑 2024-08-09 13:20:34

在大多数正则表达式方言中，单词边界是 \w 和 \W （非单词字符）之间的位置，或者位于字符串的开头或结尾（如果（分别）以单词字符 ([0-9A-Za-z_]) 开始或结束。

因此，在字符串 "-12" 中，它将匹配 1 之前或 2 之后。破折号不是单词字符。

回复收藏 0 原文

梦萦几度 2024-08-09 13:20:34

在学习正则表达式的过程中，我确实被元字符\b所困扰。当我不断地问自己“它是什么，它是什么”时，我确实没有理解它的含义。在使用网站进行一些尝试后，我注意到每个单词开头的粉红色垂直破折号并在词末。我当时就明白了它的含义。现在完全是单词(\w)-boundary。

我的观点只是以极大的理解为导向。其背后的逻辑应该从其他答案中检查。

回复收藏 0 原文

寄人书 2024-08-09 13:20:34

单词边界可以出现在以下三个位置之一：

在字符串中的第一个字符之前（如果第一个字符是单词字符）。
在字符串中的最后一个字符之后，如果最后一个字符是单词字符。
字符串中的两个字符之间，其中一个是单词字符，另一个不是单词字符。

单词字符是字母数字；减号则不是。
取自正则表达式教程。

回复收藏 0 原文

┈┾☆殇 2024-08-09 13:20:34

我想解释一下艾伦·摩尔的回答

单词边界是一个位置，其前面有一个单词字符但后面没有一个，或者后面有一个单词字符但前面没有一个。

假设我有一个字符串“This is a cat, and she's awesome”，我想替换所有出现的字母仅当该字母 ('a') 存在于“单词边界”时，才返回 'a'，

换句话说：'cat' 内的字母 a不应该被替换。

所以我将执行正则表达式（在Python) as

re.sub(r"\ba","e", myString.strip()) // 将 a 替换为e

因此，

输入；输出

这是一个acat，她a很漂亮

这是ecate而且她e很漂亮

回复收藏 0 原文

|煩躁 2024-08-09 13:20:34

单词边界是一个位置，其前面有单词字符但后面没有单词字符，或者后面有单词字符但前面没有单词字符。

回复收藏 0 原文

星星的轨迹 2024-08-09 13:20:34

我谈论 \b 样式的正则表达式边界实际上是什么此处。

简而言之，它们是有条件的。他们的行为取决于他们旁边的事物。

# same as using a \b before:
(?(?=\w) (?<!\w)  | (?<!\W) )

# same as using a \b after:
(?(?<=\w) (?!\w)  | (?!\W)  )

有时这不是你想要的。请参阅我的另一个答案进行详细说明。

I talk about what \b-style regex boundaries actually are here.

The short story is that they’re conditional. Their behavior depends on what they’re next to.

# same as using a \b before:
(?(?=\w) (?<!\w)  | (?<!\W) )

# same as using a \b after:
(?(?<=\w) (?!\w)  | (?!\W)  )

Sometimes that isn’t what you want. See my other answer for elaboration.

回复收藏 0 原文

深海少女心 2024-08-09 13:20:34

在搜索诸如 .NET、C++、C# 和 C 之类的文本时，我遇到了更糟糕的问题。您可能会认为计算机程序员比给一种难以为其编写正则表达式的语言命名更了解。

无论如何，这就是我发现的（主要总结自 http://www.regular-expressions.info ，这是一个很棒的网站）：在大多数风格的正则表达式中，与简写字符类 \w 匹配的字符是被单词边界视为单词字符的字符。 Java 是一个例外。 Java 支持 \b 的 Unicode，但不支持 \w。（我确信当时有一个很好的理由）。

\w 代表“单词字符”。它始终与 ASCII 字符 [A-Za-z0-9_] 匹配。请注意包含下划线和数字（但不是破折号！）。在大多数支持 Unicode 的风格中，\w 包含许多来自其他脚本的字符。关于实际包含哪些字符存在很多不一致之处。通常包括字母和表意文字中的字母和数字。可以包含也可以不包含除下划线和非数字数字符号之外的连接标点符号。 XML Schema 和 XPath 甚至包含 \w 中的所有符号。但 Java、JavaScript 和 PCRE 仅与 \w 匹配 ASCII 字符。

这就是为什么基于 Java 的正则表达式搜索 C++、C# 或 .NET（即使您记得转义句号和加号）也会被搞砸。通过\b。

注意：我不知道如何处理文本中的错误，例如有人没有在句子末尾的句号后添加空格。我允许这样做，但我不确定这是否一定是正确的做法。

无论如何，在 Java 中，如果您要搜索那些名称奇怪的语言的文本，则需要将 \b 替换为前后空格和标点符号指示符。例如：

public static String grep(String regexp, String multiLineStringToSearch) {
    String result = "";
    String[] lines = multiLineStringToSearch.split("\\n");
    Pattern pattern = Pattern.compile(regexp);
    for (String line : lines) {
        Matcher matcher = pattern.matcher(line);
        if (matcher.find()) {
            result = result + "\n" + line;
        }
    }
    return result.trim();
}

然后在您的测试或主要功能中：

    String beforeWord = "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|^)";   
    String afterWord =  "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|$)";
    text = "Programming in C, (C++) C#, Java, and .NET.";
    System.out.println("text="+text);
    // Here is where Java word boundaries do not work correctly on "cutesy" computer language names.  
    System.out.println("Bad word boundary can't find because of Java: grep with word boundary for .NET="+ grep("\\b\\.NET\\b", text));
    System.out.println("Should find: grep exactly for .NET="+ grep(beforeWord+"\\.NET"+afterWord, text));
    System.out.println("Bad word boundary can't find because of Java: grep with word boundary for C#="+ grep("\\bC#\\b", text));
    System.out.println("Should find: grep exactly for C#="+ grep("C#"+afterWord, text));
    System.out.println("Bad word boundary can't find because of Java:grep with word boundary for C++="+ grep("\\bC\\+\\+\\b", text));
    System.out.println("Should find: grep exactly for C++="+ grep(beforeWord+"C\\+\\+"+afterWord, text));

    System.out.println("Should find: grep with word boundary for Java="+ grep("\\bJava\\b", text));
    System.out.println("Should find: grep for case-insensitive java="+ grep("?i)\\bjava\\b", text));
    System.out.println("Should find: grep with word boundary for C="+ grep("\\bC\\b", text));  // Works Ok for this example, but see below
    // Because of the stupid too-short cutsey name, searches find stuff it shouldn't.
    text = "Worked on C&O (Chesapeake and Ohio) Canal when I was younger; more recently developed in Lisp.";
    System.out.println("text="+text);
    System.out.println("Bad word boundary because of C name: grep with word boundary for C="+ grep("\\bC\\b", text));
    System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
    // Make sure the first and last cases work OK.

    text = "C is a language that should have been named differently.";
    System.out.println("text="+text);
    System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

    text = "One language that should have been named differently is C";
    System.out.println("text="+text);
    System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

    //Make sure we don't get false positives
    text = "The letter 'c' can be hard as in Cat, or soft as in Cindy. Computer languages should not require disambiguation (e.g. Ruby, Python vs. Fortran, Hadoop)";
    System.out.println("text="+text);
    System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

PS 我感谢 http://regexpal.com/ 没有谁正则表达式的世界将会非常悲惨！

I ran into an even worse problem when searching text for words like .NET, C++, C#, and C. You would think that computer programmers would know better than to name a language something that is hard to write regular expressions for.

Anyway, this is what I found out (summarized mostly from http://www.regular-expressions.info, which is a great site): In most flavors of regex, characters that are matched by the short-hand character class \w are the characters that are treated as word characters by word boundaries. Java is an exception. Java supports Unicode for \b but not for \w. (I'm sure there was a good reason for it at the time).

The \w stands for "word character". It always matches the ASCII characters [A-Za-z0-9_]. Notice the inclusion of the underscore and digits (but not dash!). In most flavors that support Unicode, \w includes many characters from other scripts. There is a lot of inconsistency about which characters are actually included. Letters and digits from alphabetic scripts and ideographs are generally included. Connector punctuation other than the underscore and numeric symbols that aren't digits may or may not be included. XML Schema and XPath even include all symbols in \w. But Java, JavaScript, and PCRE match only ASCII characters with \w.

Which is why Java-based regex searches for C++, C# or .NET (even when you remember to escape the period and pluses) are screwed by the \b.

Note: I'm not sure what to do about mistakes in text, like when someone doesn't put a space after a period at the end of a sentence. I allowed for it, but I'm not sure that it's necessarily the right thing to do.

Anyway, in Java, if you're searching text for the those weird-named languages, you need to replace the \b with before and after whitespace and punctuation designators. For example:

public static String grep(String regexp, String multiLineStringToSearch) {
    String result = "";
    String[] lines = multiLineStringToSearch.split("\\n");
    Pattern pattern = Pattern.compile(regexp);
    for (String line : lines) {
        Matcher matcher = pattern.matcher(line);
        if (matcher.find()) {
            result = result + "\n" + line;
        }
    }
    return result.trim();
}

Then in your test or main function:

    String beforeWord = "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|^)";   
    String afterWord =  "(\\s|\\.|\\,|\\!|\\?|\\(|\\)|\\'|\\\"|$)";
    text = "Programming in C, (C++) C#, Java, and .NET.";
    System.out.println("text="+text);
    // Here is where Java word boundaries do not work correctly on "cutesy" computer language names.  
    System.out.println("Bad word boundary can't find because of Java: grep with word boundary for .NET="+ grep("\\b\\.NET\\b", text));
    System.out.println("Should find: grep exactly for .NET="+ grep(beforeWord+"\\.NET"+afterWord, text));
    System.out.println("Bad word boundary can't find because of Java: grep with word boundary for C#="+ grep("\\bC#\\b", text));
    System.out.println("Should find: grep exactly for C#="+ grep("C#"+afterWord, text));
    System.out.println("Bad word boundary can't find because of Java:grep with word boundary for C++="+ grep("\\bC\\+\\+\\b", text));
    System.out.println("Should find: grep exactly for C++="+ grep(beforeWord+"C\\+\\+"+afterWord, text));

    System.out.println("Should find: grep with word boundary for Java="+ grep("\\bJava\\b", text));
    System.out.println("Should find: grep for case-insensitive java="+ grep("?i)\\bjava\\b", text));
    System.out.println("Should find: grep with word boundary for C="+ grep("\\bC\\b", text));  // Works Ok for this example, but see below
    // Because of the stupid too-short cutsey name, searches find stuff it shouldn't.
    text = "Worked on C&O (Chesapeake and Ohio) Canal when I was younger; more recently developed in Lisp.";
    System.out.println("text="+text);
    System.out.println("Bad word boundary because of C name: grep with word boundary for C="+ grep("\\bC\\b", text));
    System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));
    // Make sure the first and last cases work OK.

    text = "C is a language that should have been named differently.";
    System.out.println("text="+text);
    System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

    text = "One language that should have been named differently is C";
    System.out.println("text="+text);
    System.out.println("grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

    //Make sure we don't get false positives
    text = "The letter 'c' can be hard as in Cat, or soft as in Cindy. Computer languages should not require disambiguation (e.g. Ruby, Python vs. Fortran, Hadoop)";
    System.out.println("text="+text);
    System.out.println("Should be blank: grep exactly for C="+ grep(beforeWord+"C"+afterWord, text));

P.S. My thanks to http://regexpal.com/ without whom the regex world would be very miserable!

回复收藏 0 原文

说不完的你爱 2024-08-09 13:20:34

参考：掌握正则表达式 (Jeffrey EF Friedl) - O'Reilly

\b 相当于 (?

回复收藏 0 原文

掌心的温暖 2024-08-09 13:20:34

查看有关边界条件的文档：

http://java. sun.com/docs/books/tutorial/essential/regex/bounds.html

查看此示例：

public static void main(final String[] args)
    {
        String x = "I found the value -12 in my string.";
        System.err.println(Arrays.toString(x.split("\\b-?\\d+\\b")));
    }

当您打印出来时，请注意输出是这样的：

[我在字符串中找到了值 -, 。 ]

这意味着“-”字符不会被认为位于单词边界上，因为它不被视为单词字符。看起来@brianary 有点先发制人，所以他得到了赞成票。

Check out the documentation on boundary conditions:

http://java.sun.com/docs/books/tutorial/essential/regex/bounds.html

Check out this sample:

public static void main(final String[] args)
    {
        String x = "I found the value -12 in my string.";
        System.err.println(Arrays.toString(x.split("\\b-?\\d+\\b")));
    }

When you print it out, notice that the output is this:

[I found the value -, in my string.]

This means that the "-" character is not being picked up as being on the boundary of a word because it's not considered a word character. Looks like @brianary kinda beat me to the punch, so he gets an up-vote.

回复收藏 0 原文

黑色毁心梦 2024-08-09 13:20:34

单词边界 \b 用于其中一个单词应该是单词字符而另一个单词应该是非单词字符的情况。
负数的正则表达式应该

--?\b\d+\b

检查工作 DEMO

Word boundary \b is used where one word should be a word character and another one a non-word character.
Regular Expression for negative number should be

--?\b\d+\b

check working DEMO

回复收藏 0 原文

稚然 2024-08-09 13:20:34

我相信您的问题是由于 - 不是单词字符。因此，单词边界将在 - 之后匹配，因此不会捕获它。单词边界匹配字符串中第一个单词字符之前和最后一个单词字符之后，以及之前是单词字符或非单词字符、之后相反的任何位置。另请注意，字边界是零宽度匹配。

一种可能的替代方案是

(?:(?:^|\s)-?)\d+\b

这将匹配以空格字符和可选破折号开头并以单词边界结尾的任何数字。它还将匹配从字符串开头开始的数字。

I believe that your problem is due to the fact that - is not a word character. Thus, the word boundary will match after the -, and so will not capture it. Word boundaries match before the first and after the last word characters in a string, as well as any place where before it is a word character or non-word character, and after it is the opposite. Also note that word boundary is a zero-width match.

One possible alternative is

(?:(?:^|\s)-?)\d+\b

This will match any numbers starting with a space character and an optional dash, and ending at a word boundary. It will also match a number starting at the beginning of the string.

回复收藏 0 原文