从字符串中删除非字母字符会留下 UTF-16 高代理项

发布于 2025-01-11 10:53:44 字数 597 浏览 0 评论 0原文

我使用正则表达式 [^\\p{L}] 和 java.util.regex.Matcher#replaceAll(String) 来匹配和删除所有非字母字符来自字符串。我注意到，对于包含 UTF-16 代理项的字符，replaceAll() 创建一个结构上无效的字符串（OpenJDK 运行时环境（build 11.0.6+10-post-Ubuntu-1ubuntu118.04.1）。

首先是一个工作示例：

import java.util.regex.Matcher;
import java.util.regex.Pattern;

class Main {  
  public static void main(String args[]) { 
    Pattern p = Pattern.compile("[^\\p{L}]");
    System.out.println(p.matcher("abcဍ*").replaceAll(""));
  } 
}

上面的程序按预期打印 abcဍ （ဍ 是 MYANMAR LETTER DDA）

现在让我们测试该字符。 “

原文

I am using the regex [^\\p{L}] and java.util.regex.Matcher#replaceAll(String) to match and remove all non-letter characters from a string. I noticed that for characters containing UTF-16 surrogates, replaceAll() creates a structurally invalid string (OpenJDK Runtime Environment (build 11.0.6+10-post-Ubuntu-1ubuntu118.04.1).

First a working example:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

class Main {  
  public static void main(String args[]) { 
    Pattern p = Pattern.compile("[^\\p{L}]");
    System.out.println(p.matcher("abcဍ*").replaceAll(""));
  } 
}

The above program prints abcဍ as expected (ဍ is MYANMAR LETTER DDA).

Now let's test the character "????" (\uD835\uDD0D, MATHEMATICAL FRAKTUR CAPITAL J, Category: Letter, Uppercase [Lu]), which contains high surrogates:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

class Main {  
  public static void main(String args[]) {
    String original = "????"; // "\uD835\uDD0D"
    
    Pattern p1 = Pattern.compile("[\\p{L}]");   // try regex without negation first
    Matcher m1 = p1.matcher(original);
    String r1 = m1.replaceAll("");
    System.out.println("r1: " + r1);
  
    Pattern p2 = Pattern.compile("[^\\p{L}]");  // now try regex with negation
    Matcher m2 = p2.matcher(original);
    String r2 = m2.replaceAll("");
    System.out.println("r2: " + r2);
    System.out.println("r2 length: " + r2.length());
    System.out.println("r2 char(0): " + (int) r2.charAt(0));

    System.out.println("original: " + original);
  } 
}

Output:

r1:                     // r1 = empty string as expected
r2: ?                   // r2 = broken string
r2 length: 1            
r2 char(0): 55349       // 0xD835 (high surrogate)
original: ????

Other examples I tested that produce structurally invalid strings:

???? (\uD801\uDCD8), OSAGE SMALL LETTER A, Category: Letter, Lowercase [Ll]
???? (\uD806\uDECA), PAU CIN HAU LETTER KHA, Category: Letter, Other [Lo]

Is my regex broken, or is this a bug in the Java Class Library? If this is a bug, is there a workaround?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

习惯成性 2025-01-18 10:53:44

在当前的 Java 版本中，问题在于 \p{L} 是一个占位符，用于助记更复杂的模式，其中包括多字符序列的替代项。

这意味着，\p{L} 模式不能在字符类内部使用，因为字符类并不匹配字符序列，仅单个 em> 字符。

因此，您需要确保在字符类之外使用它。在这里，您可以简单地使用反向 Unicode 类别类 \P{L} 来匹配除 Unicode 字母之外的任何字符。以下内容：

Pattern p2 = Pattern.compile("\\P{L}");

返回您需要的内容。

如果您无法替换为
反向类，可以使用交替。例如，如果您有 [^\d\s\p{L}]，则可以使用 (?:\P{L}|[^\d\s])< /代码>。

请注意，您的代码看起来可以在 Java 17+ 中运行（已由用户16320675确认）。

In your current Java version, the problem is that \p{L} is a placeholder, mnemonic for a more complex pattern that includes alternatives with multicharacter sequences.

That means, the \p{L} pattern cannot be used inside a character class, as character classes are not meant to match char sequences, only individual chars.

Thus, you need to make sure you use it outside of a character class. Here, you can simply use the reverse Unicode category class, \P{L}, to match any char but a Unicode letter. The following:

Pattern p2 = Pattern.compile("\\P{L}");

returns what you need.

In case you cannot replace with the
reverse class, you can use an alterenation. For example, if you have [^\d\s\p{L}], you can use (?:\P{L}|[^\d\s]).

Note that your code looks working in Java 17+ (confirmed by user16320675).

回复收藏 0 原文

娇俏 2025-01-18 10:53:44

对于这样的情况，我编写了自己的实用程序，允许将任何字符串转换为 unicode 序列，反之亦然。下面是一个示例：

result = "Hello World";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);

此代码的输出是：

\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
Hello World

该库可以在 Maven Central 或位于 Github 它作为 Maven 工件提供，并带有源代码和 javadoc

这是该类的 javadoc StringUnicodeEncoderDecoder。当我需要诊断像你这样的问题时，这对我很有帮助，而且在极端情况下，你可以将 String 转换为 unicode 并将其修改为代码，然后将其转换回来。在您找到更好的解决方案之前，这可能是您的解决方法

For cases like this I wrote my own utility that allows to convert any String to unicode sequence and vice-versa. Here is a sample:

result = "Hello World";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);

The output of this code is:

\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
Hello World

The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadoc

Here is javadoc for the class StringUnicodeEncoderDecoder. This helps me a lot when I need to diagnose problems like yours, but also in extreme scenarios you can convert String into unicodes and modify it as codes and convert it back. It might be a workaround for you until you find better solution

回复收藏 0 原文

~没有更多了~