当使用 HTML 实体转义字符串时,如果使用 UTF-8,我可以安全地跳过 Unicode 127 以上的编码字符吗?
当以 HTML 格式输出字符串时,出于可以理解的原因,必须将特殊字符转义为 HTML 实体(“&<>”等)。
我研究了两个 Java 实现: org.apache.commons.lang.StringEscapeUtils.escapeHtml(String) net.htmlparser.jericho.CharacterReference.encode(CharSequence)
两者都对 Unicode 代码点 127 (0x7F) 以上的所有字符进行转义,这实际上是所有非英语字符。
这种行为很好,但当字符是非英语(例如希伯来语或阿拉伯语)时,它生成的字符串是非人类可读的。我已经看到,当 Unicode 127 以上的字符没有像这样转义时,它们仍然可以在浏览器中正确呈现 - 我相信这是因为 html 页面是 UTF-8 编码的,因此浏览器可以理解这些字符。
我的问题:如果我的网页是 UTF-8 编码的,在转义 HTML 实体时,我可以安全地禁用代码点 127 以上的转义 Unicode 字符吗?
When outputting a string in HTML, one must escape special characters as HTML entities ("&<>" etc.) for understandable reasons.
I've examined two Java implementations of this:
org.apache.commons.lang.StringEscapeUtils.escapeHtml(String)
net.htmlparser.jericho.CharacterReference.encode(CharSequence)
Both escape all characters above Unicode code point 127 (0x7F), which is effectively all non-English characters.
This behavior is fine, but the strings it produces are non-human-readable when the characters are non-English (for example, in Hebrew or Arabic). I've seen that when chars above Unicode 127 aren't escaped like this, they still render correctly in browsers - I believe this is because the html page is UTF-8 encoded and thus these characters are understandable to the browser.
My question: Can I safely disable escaping Unicode characters above code point 127 when escaping HTML entities, provided my web page is UTF-8 encoded?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您只需要在两种情况下使用 HTML 实体:
<
)€
符号)鉴于 UTF-8 可以表示所有 Unicode 字符,因此仅适用第一种情况。
手动输入 HTML 时,如果您的编辑器和/或键盘不允许您输入某些字符,您可能会发现现在插入 HTML 实体是很实用的(直接输入
©
更容易)而不是试图弄清楚如何键入实际的©),但是当自动转义文本时,您只会使页面大小增大;-)我对Java知之甚少,但其他语言有不同的函数来编码特殊字符和所有可能的实体。
You only need to use HTML entities under two circumstances:
<
)€
symbol in a ISO-8859-1 document)Given that UTF-8 can represent all Unicode characters, only first case apply.
When typing HTML manually you may find practical to insert an HTML entity now and then if your editor and/or keyboard won't allow you to type certain character (it's easier to just type
©
rather than trying to figure out how to type an actual ©) but when escaping text automatically you just make the page size grow ;-)I know little about Java but other languages have different functions to encode special chars and all possible entities.
如果您在 mime-type 标头中发送编码:
那么浏览器会将您的源解释为 UTF-8,并且您可以将所有这些字符作为普通 UTF-8 编码字节发送。
或者,您可以在 HTML 页面的标头中指定编码,如下所示:
这样做的优点是,如果用户保护该信息并稍后从硬盘重新打开该信息,则该信息将与 HTML 页面一起存储。
就我个人而言,我会同时执行这两项操作(发送正确的标头并将
meta
标记添加到您的 HTML 页面)。只要两个地方的编码一致就可以了。更新:HTML 5 添加了用于指定的新语法编码:
If your send the encoding in the mime-type header:
then the browser will interpret your source as UTF-8 and you can send all those characters as normal UTF-8 encoded bytes.
Alternatively, you can specify the encoding in the header of your HTML page like this:
This has the advantage that the information is stored with the HTML page if the user safes it and re-opens it from his harddisk at a later time.
Personally I'd do both (send the right header and add the
meta
-tag to your HTML page). It should be fine as long as the two places agree about the encoding.Update: HTML 5 has added a new syntax for specifying the encoding: