空格匹配正则表达式 - Java
正则表达式的 Java API 指出 \ s
将匹配空白。因此正则表达式 \\s\\s
应匹配两个空格。
Pattern whitespace = Pattern.compile("\\s\\s");
matcher = whitespace.matcher(modLine);
while (matcher.find()) matcher.replaceAll(" ");
这样做的目的是用单个空格替换两个连续空格的所有实例。然而这实际上不起作用。
我对正则表达式或术语“空白”有严重误解吗?
The Java API for regular expressions states that \s
will match whitespace. So the regex \\s\\s
should match two spaces.
Pattern whitespace = Pattern.compile("\\s\\s");
matcher = whitespace.matcher(modLine);
while (matcher.find()) matcher.replaceAll(" ");
The aim of this is to replace all instances of two consecutive whitespace with a single space. However this does not actually work.
Am I having a grave misunderstanding of regexes or the term "whitespace"?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(11)
您不能在 Java 中使用
\s
来匹配其自身本机字符集上的空格,因为 Java 不支持 Unicode 空格属性 — 尽管这样做是严格要求满足 UTS#18 的 RL1.2! 唉,它所拥有的并不符合标准。Unicode 定义了 26 个代码点为
\p{White_Space}
:其中 20 个是各种\pZ
GeneralCategory=Separator,其余 6 个是\p{Cc}
GeneralCategory=Control。空白是一种相当稳定的属性,而且这些空白几乎永远存在。即便如此,Java 没有符合 Unicode 标准的属性,因此您必须使用如下代码:
现在您可以使用
whitespace_charclass + "+"
作为中的模式替换全部
。抱歉这一切。 Java 的正则表达式在其自己的本机字符集上不能很好地工作,因此您确实必须跳过异国情调才能使它们工作。
如果您认为空白不好,您应该看看需要做什么才能让
\w
和\b
最终表现正常!是的,这是可能的,是的,这是一个令人头脑麻木的混乱。这甚至是慈善事业。获得符合标准的 Java 正则表达式库的最简单方法是将 JNI 转移到 ICU 的东西。这就是 Google 对 Android 所做的事情,因为 OraSun 的不符合标准。
如果你不想这样做,但仍想坚持使用 Java,我有一个我编写的前端正则表达式重写库,可以“修复”Java 的模式,至少让它们符合 UTS#18 中的 RL1.2a,Unicode 正则表达式。
You can’t use
\s
in Java to match white space on its own native character set, because Java doesn’t support the Unicode white space property — even though doing so is strictly required to meet UTS#18’s RL1.2! What it does have is not standards-conforming, alas.Unicode defines 26 code points as
\p{White_Space}
: 20 of them are various sorts of\pZ
GeneralCategory=Separator, and the remaining 6 are\p{Cc}
GeneralCategory=Control.White space is a pretty stable property, and those same ones have been around virtually forever. Even so, Java has no property that conforms to The Unicode Standard for these, so you instead have to use code like this:
Now you can use
whitespace_charclass + "+"
as the pattern in yourreplaceAll
.Sorry ’bout all that. Java’s regexes just don’t work very well on its own native character set, and so you really have to jump through exotic hoops to make them work.
And if you think white space is bad, you should see what you have to do to get
\w
and\b
to finally behave properly!Yes, it’s possible, and yes, it’s a mindnumbing mess. That’s being charitable, even. The easiest way to get a standards-comforming regex library for Java is to JNI over to ICU’s stuff. That’s what Google does for Android, because OraSun’s doesn’t measure up.
If you don’t want to do that but still want to stick with Java, I have a front-end regex rewriting library I wrote that “fixes” Java’s patterns, at least to get them conform to the requirements of RL1.2a in UTS#18, Unicode Regular Expressions.
是的,您需要获取
matcher.replaceAll()
的结果:Yeah, you need to grab the result of
matcher.replaceAll()
:对于 Java(不是 php,不是 javascript,不是任何其他):
For Java (not php, not javascript, not anyother):
自从这个问题首次提出以来,Java 已经不断发展。您可以使用
\p{Zs}
组来匹配所有形式的 unicode 空格字符。因此,如果您想用普通空格替换一个或多个奇异空格,您可以这样做:
同样值得了解的是,如果您使用过
trim()
字符串函数,您应该看一下 (相对较新的)字符串上的strip()
、stripLeading()
和stripTrailing()
函数。它们可以帮助您修剪掉各种松散的空白字符。有关包含哪些空格的更多信息,请参阅 Java 的Character.isWhitespace()
函数。Java has evolved since this issue was first brought up. You can match all manner of unicode space characters by using the
\p{Zs}
group.Thus if you wanted to replace one or more exotic spaces with a plain space you could do this:
Also worth knowing, if you've used the
trim()
string function you should take a look at the (relatively new)strip()
,stripLeading()
, andstripTrailing()
functions on strings. They can help you trim off all sorts of squirrely white space characters. For more information on what what space is included, see Java'sCharacter.isWhitespace()
function.要匹配任何空白字符,您可以使用
Pattern.UNICODE_CHARACTER_CLASS
选项“启用预定义字符类和 POSIX 字符类的 Unicode 版本”,然后“符合Unicode 技术标准 #18:Unicode 正则表达式 附件 C:兼容性属性”。还可以使用
(?U)
嵌入标志表达式启用相同的行为。例如,如果您想使用正则表达式替换/删除 Java 中的所有 Unicode 空格,您可以使用查看 Java 在线演示:
To match any whitespace character, you can use
The
Pattern.UNICODE_CHARACTER_CLASS
option "enables the Unicode version of Predefined character classes and POSIX character classes" that are then "in conformance with Unicode Technical Standard #18: Unicode Regular Expression Annex C: Compatibility Properties".The same behavior can also be enabled with the
(?U)
embedded flag expression. For example, if you want to replace/remove all Unicode whitespaces in Java with regex, you can useSee the Java demo online:
似乎对我有用:
将打印:
我认为您打算这样做而不是您的代码:
Seems to work for me:
will print:
I think you intended to do this instead of your code:
当我向 Regexbuddy(正则表达式开发应用程序)论坛发送问题时,我得到了对 \s Java 问题的更准确答复:
“消息作者:Jan Goyvaerts
在 Java 中,简写 \s、\d 和 \w 仅包含 ASCII ... 这不是 Java 中的错误,而只是使用正则表达式时需要注意的众多事项之一。要匹配所有 Unicode 空格和换行符,可以使用 [\s\。 Java 中的 p{Z}] 尚不支持 Java 特定的属性,例如 \p{javaSpaceChar} (与 [\s\p{Z}] 匹配完全相同的字符)
。如果输入仅是 ASCII,则将匹配两个空格真正的问题在于 OP 的代码,正如该问题中接受的答案所指出的那样。”
when I sended a question to a Regexbuddy (regex developer application) forum, I got more exact reply to my \s Java question:
"Message author: Jan Goyvaerts
In Java, the shorthands \s, \d, and \w only include ASCII characters. ... This is not a bug in Java, but simply one of the many things you need to be aware of when working with regular expressions. To match all Unicode whitespace as well as line breaks, you can use [\s\p{Z}] in Java. RegexBuddy does not yet support Java-specific properties such as \p{javaSpaceChar} (which matches the exact same characters as [\s\p{Z}]).
... \s\s will match two spaces, if the input is ASCII only. The real problem is with the OP's code, as is pointed out by the accepted answer in that question."
为了您的目的,您可以使用此片段:
这会将间距标准化为单个,并且还会去除开头和结尾的空格。
For your purpose you can use this snnippet:
This will normalize the spacing to single and will strip off the starting and trailing whitespaces as well.
您可以使用更简单的:
You can use simpler:
在 RE 中使用空格很痛苦,但我相信它们有效。 OP的问题也可以使用StringTokenizer或split()方法来解决。但是,要使用 RE(取消注释 println() 以查看匹配器如何分解字符串),这里有一个示例代码:
它会生成以下内容(使用 javac 编译并在命令提示符下运行):
% java Two21WS
首字母:“ab cdef gh ij kl”
Two21WS:“ab cdef gh ij kl”
Use of whitespace in RE is a pain, but I believe they work. The OP's problem can also be solved using StringTokenizer or the split() method. However, to use RE (uncomment the println() to view how the matcher is breaking up the String), here is a sample code:
It produces the following (compile with javac and run at the command prompt):
% java Two21WS
Initial: " a b cdef gh ij kl"
Two21WS: " a b cdef gh ij kl"