空格匹配正则表达式 - Java

发布于 2024-10-12 09:44:36 字数 410 浏览 7 评论 0原文

正则表达式的 Java API 指出 \ s 将匹配空白。因此正则表达式 \\s\\s 应匹配两个空格。

Pattern whitespace = Pattern.compile("\\s\\s");
matcher = whitespace.matcher(modLine);
while (matcher.find()) matcher.replaceAll(" ");

这样做的目的是用单个空格替换两个连续空格的所有实例。然而这实际上不起作用。

我对正则表达式或术语“空白”有严重误解吗?

The Java API for regular expressions states that \s will match whitespace. So the regex \\s\\s should match two spaces.

Pattern whitespace = Pattern.compile("\\s\\s");
matcher = whitespace.matcher(modLine);
while (matcher.find()) matcher.replaceAll(" ");

The aim of this is to replace all instances of two consecutive whitespace with a single space. However this does not actually work.

Am I having a grave misunderstanding of regexes or the term "whitespace"?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(11

苯莒 2024-10-19 09:44:36

您不能在 Java 中使用 \s 来匹配其自身本机字符集上的空格,因为 Java 不支持 Unicode 空格属性 — 尽管这样做是严格要求满足 UTS#18 的 RL1.2! 唉,它所拥有的并不符合标准。

Unicode 定义了 26 个代码点为 \p{White_Space}:其中 20 个是各种 \pZ GeneralCategory=Separator,其余 6 个是 \p{Cc} GeneralCategory=Control

空白是一种相当稳定的属性,而且这些空白几乎永远存在。即便如此,Java 没有符合 Unicode 标准的属性,因此您必须使用如下代码:

String whitespace_chars =  ""       /* dummy empty string for homogeneity */
                        + "\\u0009" // CHARACTER TABULATION
                        + "\\u000A" // LINE FEED (LF)
                        + "\\u000B" // LINE TABULATION
                        + "\\u000C" // FORM FEED (FF)
                        + "\\u000D" // CARRIAGE RETURN (CR)
                        + "\\u0020" // SPACE
                        + "\\u0085" // NEXT LINE (NEL) 
                        + "\\u00A0" // NO-BREAK SPACE
                        + "\\u1680" // OGHAM SPACE MARK
                        + "\\u180E" // MONGOLIAN VOWEL SEPARATOR
                        + "\\u2000" // EN QUAD 
                        + "\\u2001" // EM QUAD 
                        + "\\u2002" // EN SPACE
                        + "\\u2003" // EM SPACE
                        + "\\u2004" // THREE-PER-EM SPACE
                        + "\\u2005" // FOUR-PER-EM SPACE
                        + "\\u2006" // SIX-PER-EM SPACE
                        + "\\u2007" // FIGURE SPACE
                        + "\\u2008" // PUNCTUATION SPACE
                        + "\\u2009" // THIN SPACE
                        + "\\u200A" // HAIR SPACE
                        + "\\u2028" // LINE SEPARATOR
                        + "\\u2029" // PARAGRAPH SEPARATOR
                        + "\\u202F" // NARROW NO-BREAK SPACE
                        + "\\u205F" // MEDIUM MATHEMATICAL SPACE
                        + "\\u3000" // IDEOGRAPHIC SPACE
                        ;        
/* A \s that actually works for Java’s native character set: Unicode */
String     whitespace_charclass = "["  + whitespace_chars + "]";    
/* A \S that actually works for  Java’s native character set: Unicode */
String not_whitespace_charclass = "[^" + whitespace_chars + "]";

现在您可以使用 whitespace_charclass + "+" 作为 中的模式替换全部


抱歉这一切。 Java 的正则表达式在其自己的本机字符集上不能很好地工作,因此您确实必须跳过异国情调才能使它们工作。

如果您认为空白不好,您应该看看需要做什么才能让 \w\b 最终表现正常!

是的,这是可能的,是的,这是一个令人头脑麻木的混乱。这甚至是慈善事业。获得符合标准的 Java 正则表达式库的最简单方法是将 JNI 转移到 ICU 的东西。这就是 Google 对 Android 所做的事情,因为 OraSun 的不符合标准。

如果你不想这样做,但仍想坚持使用 Java,我有一个我编写的前端正则表达式重写库,可以“修复”Java 的模式,至少让它们符合 UTS#18 中的 RL1.2a,Unicode 正则表达式

You can’t use \s in Java to match white space on its own native character set, because Java doesn’t support the Unicode white space property — even though doing so is strictly required to meet UTS#18’s RL1.2! What it does have is not standards-conforming, alas.

Unicode defines 26 code points as \p{White_Space}: 20 of them are various sorts of \pZ GeneralCategory=Separator, and the remaining 6 are \p{Cc} GeneralCategory=Control.

White space is a pretty stable property, and those same ones have been around virtually forever. Even so, Java has no property that conforms to The Unicode Standard for these, so you instead have to use code like this:

String whitespace_chars =  ""       /* dummy empty string for homogeneity */
                        + "\\u0009" // CHARACTER TABULATION
                        + "\\u000A" // LINE FEED (LF)
                        + "\\u000B" // LINE TABULATION
                        + "\\u000C" // FORM FEED (FF)
                        + "\\u000D" // CARRIAGE RETURN (CR)
                        + "\\u0020" // SPACE
                        + "\\u0085" // NEXT LINE (NEL) 
                        + "\\u00A0" // NO-BREAK SPACE
                        + "\\u1680" // OGHAM SPACE MARK
                        + "\\u180E" // MONGOLIAN VOWEL SEPARATOR
                        + "\\u2000" // EN QUAD 
                        + "\\u2001" // EM QUAD 
                        + "\\u2002" // EN SPACE
                        + "\\u2003" // EM SPACE
                        + "\\u2004" // THREE-PER-EM SPACE
                        + "\\u2005" // FOUR-PER-EM SPACE
                        + "\\u2006" // SIX-PER-EM SPACE
                        + "\\u2007" // FIGURE SPACE
                        + "\\u2008" // PUNCTUATION SPACE
                        + "\\u2009" // THIN SPACE
                        + "\\u200A" // HAIR SPACE
                        + "\\u2028" // LINE SEPARATOR
                        + "\\u2029" // PARAGRAPH SEPARATOR
                        + "\\u202F" // NARROW NO-BREAK SPACE
                        + "\\u205F" // MEDIUM MATHEMATICAL SPACE
                        + "\\u3000" // IDEOGRAPHIC SPACE
                        ;        
/* A \s that actually works for Java’s native character set: Unicode */
String     whitespace_charclass = "["  + whitespace_chars + "]";    
/* A \S that actually works for  Java’s native character set: Unicode */
String not_whitespace_charclass = "[^" + whitespace_chars + "]";

Now you can use whitespace_charclass + "+" as the pattern in your replaceAll.


Sorry ’bout all that. Java’s regexes just don’t work very well on its own native character set, and so you really have to jump through exotic hoops to make them work.

And if you think white space is bad, you should see what you have to do to get \w and \b to finally behave properly!

Yes, it’s possible, and yes, it’s a mindnumbing mess. That’s being charitable, even. The easiest way to get a standards-comforming regex library for Java is to JNI over to ICU’s stuff. That’s what Google does for Android, because OraSun’s doesn’t measure up.

If you don’t want to do that but still want to stick with Java, I have a front-end regex rewriting library I wrote that “fixes” Java’s patterns, at least to get them conform to the requirements of RL1.2a in UTS#18, Unicode Regular Expressions.

自在安然 2024-10-19 09:44:36

是的,您需要获取 matcher.replaceAll() 的结果:

String result = matcher.replaceAll(" ");
System.out.println(result);

Yeah, you need to grab the result of matcher.replaceAll():

String result = matcher.replaceAll(" ");
System.out.println(result);
零度° 2024-10-19 09:44:36

对于 Java(不是 php,不是 javascript,不是任何其他):

txt.replaceAll("\\p{javaSpaceChar}{2,}"," ")

For Java (not php, not javascript, not anyother):

txt.replaceAll("\\p{javaSpaceChar}{2,}"," ")
毁虫ゝ 2024-10-19 09:44:36

自从这个问题首次提出以来,Java 已经不断发展。您可以使用 \p{Zs} 组来匹配所有形式的 unicode 空格字符。

因此,如果您想用普通空格替换一个或多个奇异空格,您可以这样做:

String txt = "whatever my string is";
String newTxt = txt.replaceAll("\\p{Zs}+", " ");

同样值得了解的是,如果您使用过 trim() 字符串函数,您应该看一下 (相对较新的)字符串上的 strip()stripLeading()stripTrailing() 函数。它们可以帮助您修剪掉各种松散的空白字符。有关包含哪些空格的更多信息,请参阅 Java 的 Character.isWhitespace() 函数。

Java has evolved since this issue was first brought up. You can match all manner of unicode space characters by using the \p{Zs} group.

Thus if you wanted to replace one or more exotic spaces with a plain space you could do this:

String txt = "whatever my string is";
String newTxt = txt.replaceAll("\\p{Zs}+", " ");

Also worth knowing, if you've used the trim() string function you should take a look at the (relatively new) strip(), stripLeading(), and stripTrailing() functions on strings. They can help you trim off all sorts of squirrely white space characters. For more information on what what space is included, see Java's Character.isWhitespace() function.

夏尔 2024-10-19 09:44:36

要匹配任何空白字符,您可以使用

Pattern whitespace = Pattern.compile("\\s", Pattern.UNICODE_CHARACTER_CLASS);

Pattern.UNICODE_CHARACTER_CLASS 选项“启用预定义字符类和 POSIX 字符类的 Unicode 版本”,然后“符合Unicode 技术标准 #18:Unicode 正则表达式 附件 C:兼容性属性”。

还可以使用 (?U) 嵌入标志表达式启用相同的行为。例如,如果您想使用正则表达式替换/删除 Java 中的所有 Unicode 空格,您可以使用

String result = text.replaceAll("(?U)\\s+", ""); // removes all whitespaces
String result = text.replaceAll("(?U)\\s", "-"); // replaces each single whitespace with -
String result = text.replaceAll("(?U)\\s+", "-"); // replaces chunks of one or more consecutive whitespaces with a single -
String result = text.replaceAll("(?U)\\G\\s", "-"); // replaces each single whitespace at the start of string with -

查看 Java 在线演示

String text = "\u00A0 \u00A0\tStart reading\u00A0here..."; // \u00A0 - non-breaking space
System.out.println("Text: '" + text + "'"); // => Text: '       Start reading here...'
System.out.println(text.replaceAll("(?U)\\s+", "")); // => Startreadinghere...
System.out.println(text.replaceAll("(?U)\\s", "-")); // => ----Start-reading-here...
System.out.println(text.replaceAll("(?U)\\s+", "-")); // => -Start-reading-here...
System.out.println(text.replaceAll("(?U)\\G\\s", "-")); // => ----Start reading here... 

To match any whitespace character, you can use

Pattern whitespace = Pattern.compile("\\s", Pattern.UNICODE_CHARACTER_CLASS);

The Pattern.UNICODE_CHARACTER_CLASS option "enables the Unicode version of Predefined character classes and POSIX character classes" that are then "in conformance with Unicode Technical Standard #18: Unicode Regular Expression Annex C: Compatibility Properties".

The same behavior can also be enabled with the (?U) embedded flag expression. For example, if you want to replace/remove all Unicode whitespaces in Java with regex, you can use

String result = text.replaceAll("(?U)\\s+", ""); // removes all whitespaces
String result = text.replaceAll("(?U)\\s", "-"); // replaces each single whitespace with -
String result = text.replaceAll("(?U)\\s+", "-"); // replaces chunks of one or more consecutive whitespaces with a single -
String result = text.replaceAll("(?U)\\G\\s", "-"); // replaces each single whitespace at the start of string with -

See the Java demo online:

String text = "\u00A0 \u00A0\tStart reading\u00A0here..."; // \u00A0 - non-breaking space
System.out.println("Text: '" + text + "'"); // => Text: '       Start reading here...'
System.out.println(text.replaceAll("(?U)\\s+", "")); // => Startreadinghere...
System.out.println(text.replaceAll("(?U)\\s", "-")); // => ----Start-reading-here...
System.out.println(text.replaceAll("(?U)\\s+", "-")); // => -Start-reading-here...
System.out.println(text.replaceAll("(?U)\\G\\s", "-")); // => ----Start reading here... 
雪落纷纷 2024-10-19 09:44:36

似乎对我有用:

String s = "  a   b      c";
System.out.println("\""  + s.replaceAll("\\s\\s", " ") + "\"");

将打印:

" a  b   c"

我认为您打算这样做而不是您的代码:

Pattern whitespace = Pattern.compile("\\s\\s");
Matcher matcher = whitespace.matcher(s);
String result = "";
if (matcher.find()) {
    result = matcher.replaceAll(" ");
}

System.out.println(result);

Seems to work for me:

String s = "  a   b      c";
System.out.println("\""  + s.replaceAll("\\s\\s", " ") + "\"");

will print:

" a  b   c"

I think you intended to do this instead of your code:

Pattern whitespace = Pattern.compile("\\s\\s");
Matcher matcher = whitespace.matcher(s);
String result = "";
if (matcher.find()) {
    result = matcher.replaceAll(" ");
}

System.out.println(result);
热鲨 2024-10-19 09:44:36

当我向 Regexbuddy(正则表达式开发应用程序)论坛发送问题时,我得到了对 \s Java 问题的更准确答复:

“消息作者:Jan Goyvaerts

在 Java 中,简写 \s、\d 和 \w 仅包含 ASCII ... 这不是 Java 中的错误,而只是使用正则表达式时需要注意的众多事项之一。要匹配所有 Unicode 空格和换行符,可以使用 [\s\。 Java 中的 p{Z}] 尚不支持 Java 特定的属性,例如 \p{javaSpaceChar} (与 [\s\p{Z}] 匹配完全相同的字符)

。如果输入仅是 ASCII,则将匹配两个空格真正的问题在于 OP 的代码,正如该问题中接受的答案所指出的那样。”

when I sended a question to a Regexbuddy (regex developer application) forum, I got more exact reply to my \s Java question:

"Message author: Jan Goyvaerts

In Java, the shorthands \s, \d, and \w only include ASCII characters. ... This is not a bug in Java, but simply one of the many things you need to be aware of when working with regular expressions. To match all Unicode whitespace as well as line breaks, you can use [\s\p{Z}] in Java. RegexBuddy does not yet support Java-specific properties such as \p{javaSpaceChar} (which matches the exact same characters as [\s\p{Z}]).

... \s\s will match two spaces, if the input is ASCII only. The real problem is with the OP's code, as is pointed out by the accepted answer in that question."

神也荒唐 2024-10-19 09:44:36
Pattern whitespace = Pattern.compile("\\s\\s");
matcher = whitespace.matcher(modLine);

boolean flag = true;
while(flag)
{
 //Update your original search text with the result of the replace
 modLine = matcher.replaceAll(" ");
 //reset matcher to look at this "new" text
 matcher = whitespace.matcher(modLine);
 //search again ... and if no match , set flag to false to exit, else run again
 if(!matcher.find())
 {
 flag = false;
 }
}
Pattern whitespace = Pattern.compile("\\s\\s");
matcher = whitespace.matcher(modLine);

boolean flag = true;
while(flag)
{
 //Update your original search text with the result of the replace
 modLine = matcher.replaceAll(" ");
 //reset matcher to look at this "new" text
 matcher = whitespace.matcher(modLine);
 //search again ... and if no match , set flag to false to exit, else run again
 if(!matcher.find())
 {
 flag = false;
 }
}
泅人 2024-10-19 09:44:36

为了您的目的,您可以使用此片段:

import org.apache.commons.lang3.StringUtils;

StringUtils.normalizeSpace(string);

这会将间距标准化为单个,并且还会去除开头和结尾的空格。

String sampleString = "Hello    world!";
sampleString.replaceAll("\\s{2}", " "); // replaces exactly two consecutive spaces
sampleString.replaceAll("\\s{2,}", " "); // replaces two or more consecutive white spaces

For your purpose you can use this snnippet:

import org.apache.commons.lang3.StringUtils;

StringUtils.normalizeSpace(string);

This will normalize the spacing to single and will strip off the starting and trailing whitespaces as well.

String sampleString = "Hello    world!";
sampleString.replaceAll("\\s{2}", " "); // replaces exactly two consecutive spaces
sampleString.replaceAll("\\s{2,}", " "); // replaces two or more consecutive white spaces
眼中杀气 2024-10-19 09:44:36

您可以使用更简单的:

String out = in.replaceAll(" {2}", " ");

You can use simpler:

String out = in.replaceAll(" {2}", " ");
﹏雨一样淡蓝的深情 2024-10-19 09:44:36

在 RE 中使用空格很痛苦,但我相信它们有效。 OP的问题也可以使用StringTokenizer或split()方法来解决。但是,要使用 RE(取消注释 println() 以查看匹配器如何分解字符串),这里有一个示例代码:

import java.util.regex.*;

public class Two21WS {
    private String  str = "";
    private Pattern pattern = Pattern.compile ("\\s{2,}");  // multiple spaces

    public Two21WS (String s) {
            StringBuffer sb = new StringBuffer();
            Matcher matcher = pattern.matcher (s);
            int startNext = 0;
            while (matcher.find (startNext)) {
                    if (startNext == 0)
                            sb.append (s.substring (0, matcher.start()));
                    else
                            sb.append (s.substring (startNext, matcher.start()));
                    sb.append (" ");
                    startNext = matcher.end();
                    //System.out.println ("Start, end = " + matcher.start()+", "+matcher.end() +
                    //                      ", sb: \"" + sb.toString() + "\"");
            }
            sb.append (s.substring (startNext));
            str = sb.toString();
    }

    public String toString () {
            return str;
    }

    public static void main (String[] args) {
            String tester = " a    b      cdef     gh  ij   kl";
            System.out.println ("Initial: \"" + tester + "\"");
            System.out.println ("Two21WS: \"" + new Two21WS(tester) + "\"");
}}

它会生成以下内容(使用 javac 编译并在命令提示符下运行):

% java Two21WS
首字母:“ab cdef gh ij kl”
Two21WS:“ab cdef gh ij kl”

Use of whitespace in RE is a pain, but I believe they work. The OP's problem can also be solved using StringTokenizer or the split() method. However, to use RE (uncomment the println() to view how the matcher is breaking up the String), here is a sample code:

import java.util.regex.*;

public class Two21WS {
    private String  str = "";
    private Pattern pattern = Pattern.compile ("\\s{2,}");  // multiple spaces

    public Two21WS (String s) {
            StringBuffer sb = new StringBuffer();
            Matcher matcher = pattern.matcher (s);
            int startNext = 0;
            while (matcher.find (startNext)) {
                    if (startNext == 0)
                            sb.append (s.substring (0, matcher.start()));
                    else
                            sb.append (s.substring (startNext, matcher.start()));
                    sb.append (" ");
                    startNext = matcher.end();
                    //System.out.println ("Start, end = " + matcher.start()+", "+matcher.end() +
                    //                      ", sb: \"" + sb.toString() + "\"");
            }
            sb.append (s.substring (startNext));
            str = sb.toString();
    }

    public String toString () {
            return str;
    }

    public static void main (String[] args) {
            String tester = " a    b      cdef     gh  ij   kl";
            System.out.println ("Initial: \"" + tester + "\"");
            System.out.println ("Two21WS: \"" + new Two21WS(tester) + "\"");
}}

It produces the following (compile with javac and run at the command prompt):

% java Two21WS
Initial: " a b cdef gh ij kl"
Two21WS: " a b cdef gh ij kl"

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文