移植twemoji正则提取Java中的Unicode表情符号

发布于 2025-01-30 17:37:39 字数 923 浏览 4 评论 0 原文

我正在尝试在提取的字符串中识别相同的表情符号,

假设我们有表情符号(CodeUnits为 \ code> \ ud83e \ UDE94 )。在JavaScript Regex中,这是由, \ ud83e [\ ude94- \ ude99] 首先匹配 \ ude83e ,然后查找后续 \ ude94 >在支架内指示的范围内。但是,在Java Regex中的表达式完全无法匹配。如果我将Java模式修改为 [\ ud83e [\ ude94- \ ude99]]] ,根据,第二半部分被捕获,但没有捕获第一。

我的工作理论是,Java遇到括号并将内部的所有内容都视为单个编码点,并且在与外部CodeUnit结合使用时,认为它正在寻找两个编码点,而不是一个编码。是否有一种简单的方法可以解决此问题或围绕它的正则方式?显而易见的修复方法是使用 [\ ud83e \ ude94- \ ud83e \ ude99] 之类的东西,实际的正格图案很长。我想知道这里是否也在这里的某个地方也有一个简单的编码修复程序。

玩具样本下面:

public static void main(String[] args) {
    String emojiPattern = "\ud83e[\ude94-\ude99]";
    String raw = "\ud83e\ude94";
    Pattern pattern = Pattern.compile(emojiPattern);
    Matcher matcher = pattern.matcher(raw);
    System.out.println(matcher.matches());
}

I'm trying to identify the same emojis in a String for extraction that Twemoji would, using Java. A straight up port isn't working for a great deal of emojis - I think I've identified the issue, so I'll give it in an example below:

Suppose we have the emoji ???? (Codeunits being \ud83e\ude94). In Javascript regex, this is captured by, \ud83e[\ude94-\ude99] which will first match the \ude83e then find subsequent \ude94 within the range indicated inside the brackets. The same expression in Java regex, however, fails to match at all. If I modify the Java pattern to [\ud83e[\ude94-\ude99]], according to an online engine, the 2nd half is captured, but not the 1st.

My working theory is that Java encounters the brackets and treats everything inside as a single codepoint and when combined with the outside codeunit, thinks it's looking for two codepoints instead of one. Is there an easy way to fix this or the regex pattern to work around it? The obvious fix would be to use something like [\ud83e\ude94-\ud83e\ude99], the actual regex pattern is quite lengthy. I wonder if there might be an easy encoding fix somewhere here as well.

Toy sample below:

public static void main(String[] args) {
    String emojiPattern = "\ud83e[\ude94-\ude99]";
    String raw = "\ud83e\ude94";
    Pattern pattern = Pattern.compile(emojiPattern);
    Matcher matcher = pattern.matcher(raw);
    System.out.println(matcher.matches());
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

叹倦 2025-02-06 17:37:39

如果您试图匹配单个特定的代码点,请不要弄乱替代配对;按数字参考它:

String emojiPattern = "\\x{1FA94}";

或按名称:

String emojiPattern = "\\N{DIYA LAMP}"

如果要匹配块u+1fa94中的任何codepoint,请使用属性原子中的块的名称:

String emojiPattern = "\\p{blk=Symbols and Pictographs Extended-A}";

如果您切换了这三个正则表达式中的任何一个,则您的示例程序程序将打印“ true”。

您遇到的问题是UTF-16替代对是一个单个编码点,而RE引擎匹配CodePoint,而不是代码单位。您不能仅匹配低或高的一半 - 仅模式“ \ ud83e” 也将无法匹配(与 Matcher#find 一起使用时,而不是例如,Matcher#匹配当然)。全部或全部。

要执行您想要的范围匹配,您必须从正则表达式中转动并直接查看代码单元。类似

char[] codeUnits = raw.toCharArray();
for (int i = 0; i < codeUnits.length - 1; i++) {
    if (codeUnits[i] == 0xD83E &&
        (codeUnits[i + 1] >= 0xDE94 && codeUnits[i + 1] <= 0xDE99)) {
        System.out.println("match");
    }
}

If you're trying to match a single specific codepoint, don't mess with surrogate pairs; refer to it by number:

String emojiPattern = "\\x{1FA94}";

or by name:

String emojiPattern = "\\N{DIYA LAMP}"

If you want to match any codepoint in the block U+1FA94 is in, use the name of the block in a property atom:

String emojiPattern = "\\p{blk=Symbols and Pictographs Extended-A}";

If you switch out any of these three regular expressions your example program will print 'true'.

The problem you're running into is a UTF-16 surrogate pair is a single codepoint, and the RE engine matches codepoints, not code units; you can't match just the low or high half - just the pattern "\ud83e" will fail to match too (When used with Matcher#find instead of Matcher#matches of course), for example. It's all or none.

To do the kind of ranged matching you want, you have to turn away from regular expressions and look at the code units directly. Something like

char[] codeUnits = raw.toCharArray();
for (int i = 0; i < codeUnits.length - 1; i++) {
    if (codeUnits[i] == 0xD83E &&
        (codeUnits[i + 1] >= 0xDE94 && codeUnits[i + 1] <= 0xDE99)) {
        System.out.println("match");
    }
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文