在 Java 正则表达式中匹配 Unicode 破折号?
我正在尝试制作一个Java正则表达式,以使用Pattern.split()将通用格式“foo - bar”的字符串拆分为“foo”和“bar”。 “-”字符可能是几个破折号之一:ASCII '-'、em-dash、en-dash 等。我构建了以下正则表达式:
private static final Pattern titleSegmentSeparator = Pattern.compile("\\s(\\x45|\\u8211|\\u8212|\\u8213|\\u8214)\\s");
which,如果我正确阅读 Pattern 文档,当两边都被空格包围时,应该捕获任何 unicode 破折号或 ascii 破折号。我使用的模式如下:
String[] sectionSegments = titleSegmentSeparator.split(sectionTitle);
没有快乐。对于下面的示例输入,未检测到破折号,并且 titleSegmentSeparator.matcher(sectionTitle).find() 返回 false!
为了确保我没有丢失任何不寻常的字符实体,我使用 System.out 打印一些调试信息。输出如下——每个字符后面都跟着 (int)char 的输出,这应该是它的 unicode 代码点,不是吗?
输入示例:
研究总结(共 10 项)- 竞争
S(83)t(116)u(117)d(100)y(121) (32)S(83)u(117)m(109)m(109)a(97)r(114)y(121) (32)((40)1(49) (32)o(111)f(102) (32)1(49)0(48))(41) (32)–(8211) (32)C(67)o(111)m(109)p(112)e(101)t(116)i(105)t(116)i(105)o(111)n(110)
看起来我喜欢那个破折号是代码点 8211,它应该与正则表达式匹配,但事实并非如此!这是怎么回事?
I'm trying to craft a Java regular expression to split strings of the general format "foo - bar" into "foo" and "bar" using Pattern.split(). The "-" character may be one of several dashes: the ASCII '-', the em-dash, the en-dash, etc. I've constructed the following regular expression:
private static final Pattern titleSegmentSeparator = Pattern.compile("\\s(\\x45|\\u8211|\\u8212|\\u8213|\\u8214)\\s");
which, if I'm reading the Pattern documentation correctly, should capture any of the unicode dashes or the ascii dash, when surrounded on both sides by whitespace. I'm using the pattern as follows:
String[] sectionSegments = titleSegmentSeparator.split(sectionTitle);
No joy. For the sample input below, the dash is not detected, and
titleSegmentSeparator.matcher(sectionTitle).find() returns false!
In order to make sure I wasn't missing any unusual character entities, I used System.out to print some debug information. The output is as follows -- each character is followed by the output of (int)char, which should be its' unicode code point, no?
Sample input:
Study Summary (1 of 10) – Competition
S(83)t(116)u(117)d(100)y(121)
(32)S(83)u(117)m(109)m(109)a(97)r(114)y(121)
(32)((40)1(49) (32)o(111)f(102)
(32)1(49)0(48))(41) (32)–(8211)
(32)C(67)o(111)m(109)p(112)e(101)t(116)i(105)t(116)i(105)o(111)n(110)
It looks to me like that dash is codepoint 8211, which should be matched by the regex, but it isn't! What's going on here?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您混合使用了十进制 (
8211
) 和十六进制 (0x8211
)。\x
和\u
都需要十六进制数字,因此您需要使用\u2014
来匹配破折号,而不是 < code>\u8211 (以及\x2D
用于普通连字符等)。但为什么不简单地使用 Unicode 属性“破折号标点符号”呢?
作为 Java 字符串:
"\\s\\p{Pd}\\s"
You're mixing decimal (
8211
) and hexadecimal (0x8211
).\x
and\u
both expect a hexadecimal number, therefore you'd need to use\u2014
to match the em-dash, not\u8211
(and\x2D
for the normal hyphen etc.).But why not simply use the Unicode property "Dash punctuation"?
As a Java string:
"\\s\\p{Pd}\\s"