在 Java 正则表达式中匹配 Unicode 破折号？

发布于 2024-09-06 09:16:31 字数 1018 浏览 16 评论 0原文

我正在尝试制作一个Java正则表达式，以使用Pattern.split()将通用格式“foo - bar”的字符串拆分为“foo”和“bar”。 “-”字符可能是几个破折号之一：ASCII '-'、em-dash、en-dash 等。我构建了以下正则表达式：

private static final Pattern titleSegmentSeparator = Pattern.compile("\\s(\\x45|\\u8211|\\u8212|\\u8213|\\u8214)\\s");

which，如果我正确阅读 Pattern 文档，当两边都被空格包围时，应该捕获任何 unicode 破折号或 ascii 破折号。我使用的模式如下：

String[] sectionSegments = titleSegmentSeparator.split(sectionTitle);

没有快乐。对于下面的示例输入，未检测到破折号，并且 titleSegmentSeparator.matcher(sectionTitle).find() 返回 false！

为了确保我没有丢失任何不寻常的字符实体，我使用 System.out 打印一些调试信息。输出如下——每个字符后面都跟着 (int)char 的输出，这应该是它的 unicode 代码点，不是吗？

输入示例：

研究总结（共 10 项）- 竞争
S(83)t(116)u(117)d(100)y(121) (32)S(83)u(117)m(109)m(109)a(97)r(114)y(121) (32)((40)1(49) (32)o(111)f(102) (32)1(49)0(48))(41) (32)–(8211) (32)C(67)o(111)m(109)p(112)e(101)t(116)i(105)t(116)i(105)o(111)n(110)

看起来我喜欢那个破折号是代码点 8211，它应该与正则表达式匹配，但事实并非如此！这是怎么回事？

原文

I'm trying to craft a Java regular expression to split strings of the general format "foo - bar" into "foo" and "bar" using Pattern.split(). The "-" character may be one of several dashes: the ASCII '-', the em-dash, the en-dash, etc. I've constructed the following regular expression:

private static final Pattern titleSegmentSeparator = Pattern.compile("\\s(\\x45|\\u8211|\\u8212|\\u8213|\\u8214)\\s");

which, if I'm reading the Pattern documentation correctly, should capture any of the unicode dashes or the ascii dash, when surrounded on both sides by whitespace. I'm using the pattern as follows:

String[] sectionSegments = titleSegmentSeparator.split(sectionTitle);

No joy. For the sample input below, the dash is not detected, and
titleSegmentSeparator.matcher(sectionTitle).find() returns false!

In order to make sure I wasn't missing any unusual character entities, I used System.out to print some debug information. The output is as follows -- each character is followed by the output of (int)char, which should be its' unicode code point, no?

Sample input:

Study Summary (1 of 10) – Competition
S(83)t(116)u(117)d(100)y(121)
(32)S(83)u(117)m(109)m(109)a(97)r(114)y(121)
(32)((40)1(49) (32)o(111)f(102)
(32)1(49)0(48))(41) (32)–(8211)
(32)C(67)o(111)m(109)p(112)e(101)t(116)i(105)t(116)i(105)o(111)n(110)

It looks to me like that dash is codepoint 8211, which should be matched by the regex, but it isn't! What's going on here?

分享到QQ

分享到微博