“org.apache.commons.lang.StringEscapeUtils”和“破折号”

发布于 2024-10-17 16:23:55 字数 317 浏览 1 评论 0原文

我正在使用“*org.apache.commons.lang.StringEscapeUtils.unescapeHtml(myHtmlString)”将 Html 实体转义转换为包含与转义相对应的实际 Unicode 字符的字符串。但是它无法正确解析“em dash”和“en dash”符号。 StringEscapeUtils 将“–”替换为“\u0096”,而正确的错误位置是“\u2013”​​。正如我所读到的,“\u0096”相当于 cp1252 的“–”。那么我怎样才能让它以正确的方式工作呢?我知道我可以手动替换它,但我想知道是否可以使用 StringEscapeUtils 或任何其他实用程序来替换它。

I am using "*org.apache.commons.lang.StringEscapeUtils.unescapeHtml(myHtmlString)" to convert Html entity escapes to a string containing the actual Unicode characters corresponding to the escapes. However it doesn't parse "em dash" and "en dash" symbols properly. StringEscapeUtils replaces "–" with "\u0096" while the correct misplacement is "\u2013". And as I have read "\u0096" is cp1252 equivalent for "–". So how can I make it work in a right way? I know that I can replace it manually but I wonder if I can do it with StringEscapeUtils or with any other util.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

勿忘初心 2024-10-24 16:23:55
And as I have read "\u0096" is cp1252 equivalent for "–".

我不这么认为。 Unicode 中的 0x0096 是 C1 控制代码:

http://en.wikipedia.org/wiki/C0_and_C1_control_codes

并且不太可能替代“-”(如您所写)。

好吧,如果 StringEscapeUtils 真的搞砸了(短划线确实应该是 \u2013),并且如果它是唯一的转义符,那么它就会搞砸,并且如果没有理由在您的字符串中包含任何其他 0x0096,那么replaceAll之后调用StringEscapeUtils应该可以工作。

以下内容实现了您期望的替换:

System.out.println("Broken\u0096stuff".replaceAll("\u0096", "\u2013"));

但是,您应该首先确保 StringEscapeUtils 确实把事情弄乱了,并且真正真正理解为什么/如何在 Java 字符串中获得 0x0096。

然后,也许还应该向您指出,遗憾的是 Java 的 Unicode 支持是一个重大问题,因为 Java 是在 Unicode 3.1 出现之前构思的。

因此,使用 16 位作为 char 原语似乎是一个聪明的主意,使用 4 个十六进制数字 '\uxxxx' 转义序列似乎是一个聪明的主意,表示 的长度似乎是一个聪明的主意String 的 length() 方法中的 char[] 等。

这些实际上都是非常非常愚蠢的想法,导致了主要的 Java 问题之一,其中 char 原语实际上无法容纳不再是 Unicode 字符,并且 String 的 length 方法实际上返回 String 的实际长度。

我喜欢以下内容:

final char brokenCharCannotRepresentUnicode31Codepoints = '\uFFFF'; // How do I store a Unicode 3.1 codepoint here!?

为什么这样咆哮?好吧,因为我不知道 String 的 replaceAll 中的正则表达式替换是如何实现的,但如果有情况,我真的不会感到惊讶(ie 某些代码点),其中 String 的 replaceAll 是,就像 charlength 以及 \uxxxx 一样。嗯,彻底崩溃了。

And as I have read "\u0096" is cp1252 equivalent for "–".

I don't think so. 0x0096 in Unicode is a C1 control code:

http://en.wikipedia.org/wiki/C0_and_C1_control_codes

and is unlikely to be the replacement for "-" (as you wrote).

Well, if StringEscapeUtils really messes this up (en dash should indeed be \u2013) and if it's the only escape it is messing up and if there's no reason to have any other 0x0096 in your String, then a replaceAll after having calling StringEscapeUtils should work.

The following does the replace you expect:

System.out.println("Broken\u0096stuff".replaceAll("\u0096", "\u2013"));

However you should first make sure that StringEscapeUtils really messes things up and really, really, understand why/how you get that 0x0096 in a Java String.

Then, also, it should probably be pointed out to you that sadly Java's Unicode support is a major SNAFU because Java was conceived before Unicode 3.1 came out.

Hence it seemed a smart idea to use 16 bits for the char primitive, it seemed a smart idea to use a 4-hexdigits '\uxxxx' escape sequence, it seemed a smart idea to represent the length of the char[] in String's length() method, etc.

These were actually all very very stupid idea leading to one of the major Java SNAFU where the char primitive cannot actually hold a Unicode char anymore and where String's length method does actually not return a String's real length.

I like the following:

final char brokenCharCannotRepresentUnicode31Codepoints = '\uFFFF'; // How do I store a Unicode 3.1 codepoint here!?

Why this rant? Well, because I don't know how the regexp replacement in String's replaceAll is implemented but I really wouldn't be suprised if there were cases (i.e. certain codepoints) where String's replaceAll was, like char and like length and like \uxxxx, well.. hmmm, totally broken.

梦里的微风 2024-10-24 16:23:55

我怀疑问题不在 StringEscapeUtils.unescapeHtml(...) 调用中。

相反,我怀疑该字符在调用之前已被转换为 '\u0096' 。更具体地说,我怀疑您的代码在将 HTML 读取为字符时使用了错误的字符集。

正如你所说,破折号是 cp1252 中的代码点 0x96 。因此,将短划线错误翻译为 unicode 代码点 \u0096 的一种方法是从使用 cp1252 编码的字节流开始,并使用 InputStreamReader(是“Latin-1”)

I suspect that the problem is not in the StringEscapeUtils.unescapeHtml(...) call.

Instead, I suspect that the character has been turned into '\u0096' before the call. More specifically, I suspect that your code has used the wrong character set when reading the HTML as characters.

As you say, an en-dash is code-point 0x96 in cp1252. So one way to get an en-dashed mistranslated to the unicode code-point \u0096 would be to start with a byte stream that was encoded using cp1252 and read / decode it using an InputStreamReader(is, "Latin-1").

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文