我正在尝试在提取的字符串中识别相同的表情符号,
假设我们有表情符号(CodeUnits为 \ code> \ ud83e \ UDE94
)。在JavaScript Regex中,这是由, \ ud83e [\ ude94- \ ude99]
首先匹配 \ ude83e
,然后查找后续 \ ude94
>在支架内指示的范围内。但是,在Java Regex中的表达式完全无法匹配。如果我将Java模式修改为 [\ ud83e [\ ude94- \ ude99]]]
,根据,第二半部分被捕获,但没有捕获第一。
我的工作理论是,Java遇到括号并将内部的所有内容都视为单个编码点,并且在与外部CodeUnit结合使用时,认为它正在寻找两个编码点,而不是一个编码。是否有一种简单的方法可以解决此问题或围绕它的正则方式?显而易见的修复方法是使用 [\ ud83e \ ude94- \ ud83e \ ude99]
之类的东西,实际的正格图案很长。我想知道这里是否也在这里的某个地方也有一个简单的编码修复程序。
玩具样本下面:
public static void main(String[] args) {
String emojiPattern = "\ud83e[\ude94-\ude99]";
String raw = "\ud83e\ude94";
Pattern pattern = Pattern.compile(emojiPattern);
Matcher matcher = pattern.matcher(raw);
System.out.println(matcher.matches());
}
I'm trying to identify the same emojis in a String for extraction that Twemoji would, using Java. A straight up port isn't working for a great deal of emojis - I think I've identified the issue, so I'll give it in an example below:
Suppose we have the emoji ???? (Codeunits being \ud83e\ude94
). In Javascript regex, this is captured by, \ud83e[\ude94-\ude99]
which will first match the \ude83e
then find subsequent \ude94
within the range indicated inside the brackets. The same expression in Java regex, however, fails to match at all. If I modify the Java pattern to [\ud83e[\ude94-\ude99]]
, according to an online engine, the 2nd half is captured, but not the 1st.
My working theory is that Java encounters the brackets and treats everything inside as a single codepoint and when combined with the outside codeunit, thinks it's looking for two codepoints instead of one. Is there an easy way to fix this or the regex pattern to work around it? The obvious fix would be to use something like [\ud83e\ude94-\ud83e\ude99]
, the actual regex pattern is quite lengthy. I wonder if there might be an easy encoding fix somewhere here as well.
Toy sample below:
public static void main(String[] args) {
String emojiPattern = "\ud83e[\ude94-\ude99]";
String raw = "\ud83e\ude94";
Pattern pattern = Pattern.compile(emojiPattern);
Matcher matcher = pattern.matcher(raw);
System.out.println(matcher.matches());
}
发布评论
评论(1)
如果您试图匹配单个特定的代码点,请不要弄乱替代配对;按数字参考它:
或按名称:
如果要匹配块u+1fa94中的任何codepoint,请使用属性原子中的块的名称:
如果您切换了这三个正则表达式中的任何一个,则您的示例程序程序将打印“ true”。
您遇到的问题是UTF-16替代对是一个单个编码点,而RE引擎匹配CodePoint,而不是代码单位。您不能仅匹配低或高的一半 - 仅模式
“ \ ud83e”
也将无法匹配(与Matcher#find
一起使用时,而不是例如,Matcher#匹配
当然)。全部或全部。要执行您想要的范围匹配,您必须从正则表达式中转动并直接查看代码单元。类似
If you're trying to match a single specific codepoint, don't mess with surrogate pairs; refer to it by number:
or by name:
If you want to match any codepoint in the block U+1FA94 is in, use the name of the block in a property atom:
If you switch out any of these three regular expressions your example program will print 'true'.
The problem you're running into is a UTF-16 surrogate pair is a single codepoint, and the RE engine matches codepoints, not code units; you can't match just the low or high half - just the pattern
"\ud83e"
will fail to match too (When used withMatcher#find
instead ofMatcher#matches
of course), for example. It's all or none.To do the kind of ranged matching you want, you have to turn away from regular expressions and look at the code units directly. Something like