“org.apache.commons.lang.StringEscapeUtils”和“破折号”

发布于 2024-10-17 16:23:55 字数 317 浏览 10 评论 0原文

我正在使用“*org.apache.commons.lang.StringEscapeUtils.unescapeHtml(myHtmlString)”将 Html 实体转义转换为包含与转义相对应的实际 Unicode 字符的字符串。但是它无法正确解析“em dash”和“en dash”符号。 StringEscapeUtils 将“–”替换为“\u0096”，而正确的错误位置是“\u2013”。正如我所读到的，“\u0096”相当于 cp1252 的“–”。那么我怎样才能让它以正确的方式工作呢？我知道我可以手动替换它，但我想知道是否可以使用 StringEscapeUtils 或任何其他实用程序来替换它。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

勿忘初心 2024-10-24 16:23:55

And as I have read "\u0096" is cp1252 equivalent for "–".

我不这么认为。 Unicode 中的 0x0096 是 C1 控制代码：

http://en.wikipedia.org/wiki/C0_and_C1_control_codes

并且不太可能替代“-”（如您所写）。

好吧，如果 StringEscapeUtils 真的搞砸了（短划线确实应该是 \u2013），并且如果它是唯一的转义符，那么它就会搞砸，并且如果没有理由在您的字符串中包含任何其他 0x0096，那么replaceAll之后调用StringEscapeUtils应该可以工作。

以下内容实现了您期望的替换：

System.out.println("Broken\u0096stuff".replaceAll("\u0096", "\u2013"));

但是，您应该首先确保 StringEscapeUtils 确实把事情弄乱了，并且真正真正理解为什么/如何在 Java 字符串中获得 0x0096。

然后，也许还应该向您指出，遗憾的是 Java 的 Unicode 支持是一个重大问题，因为 Java 是在 Unicode 3.1 出现之前构思的。

因此，使用 16 位作为 char 原语似乎是一个聪明的主意，使用 4 个十六进制数字 '\uxxxx' 转义序列似乎是一个聪明的主意，表示的长度似乎是一个聪明的主意String 的 length() 方法中的 char[] 等。

这些实际上都是非常非常愚蠢的想法，导致了主要的 Java 问题之一，其中 char 原语实际上无法容纳不再是 Unicode 字符，并且 String 的 length 方法实际上不返回 String 的实际长度。

我喜欢以下内容：

final char brokenCharCannotRepresentUnicode31Codepoints = '\uFFFF'; // How do I store a Unicode 3.1 codepoint here!?

为什么这样咆哮？好吧，因为我不知道 String 的 replaceAll 中的正则表达式替换是如何实现的，但如果有情况，我真的不会感到惊讶（ie 某些代码点），其中 String 的 replaceAll 是，就像 char 和 length 以及 \uxxxx 一样。嗯，彻底崩溃了。

And as I have read "\u0096" is cp1252 equivalent for "–".

I don't think so. 0x0096 in Unicode is a C1 control code:

http://en.wikipedia.org/wiki/C0_and_C1_control_codes

and is unlikely to be the replacement for "-" (as you wrote).

Well, if StringEscapeUtils really messes this up (en dash should indeed be \u2013) and if it's the only escape it is messing up and if there's no reason to have any other 0x0096 in your String, then a replaceAll after having calling StringEscapeUtils should work.

The following does the replace you expect:

System.out.println("Broken\u0096stuff".replaceAll("\u0096", "\u2013"));

However you should first make sure that StringEscapeUtils really messes things up and really, really, understand why/how you get that 0x0096 in a Java String.

Then, also, it should probably be pointed out to you that sadly Java's Unicode support is a major SNAFU because Java was conceived before Unicode 3.1 came out.

Hence it seemed a smart idea to use 16 bits for the char primitive, it seemed a smart idea to use a 4-hexdigits '\uxxxx' escape sequence, it seemed a smart idea to represent the length of the char[] in String's length() method, etc.

These were actually all very very stupid idea leading to one of the major Java SNAFU where the char primitive cannot actually hold a Unicode char anymore and where String's length method does actually not return a String's real length.

I like the following:

final char brokenCharCannotRepresentUnicode31Codepoints = '\uFFFF'; // How do I store a Unicode 3.1 codepoint here!?

Why this rant? Well, because I don't know how the regexp replacement in String's replaceAll is implemented but I really wouldn't be suprised if there were cases (i.e. certain codepoints) where String's replaceAll was, like char and like length and like \uxxxx, well.. hmmm, totally broken.

回复收藏 0 原文