Character references should be used when document creating/editing software, the data storage or a transport channel cannot store Unicode data or preserve the byte stream it is encoded to.
Practically this could mean that work needs to be done with legacy applications or with legacy configuration or with legacy transport protocols. In such cases it is possible that some part of the toolchain supports only 8-bit encodings or even ASCII only. Storing Unicode characters as such is not possible in such cases so reverting to character references on all but ASCII characters could be useful then, because that way you can avoid nasty encoding conversion problems that might appear when switching from 8-bit encodings to Unicode. Using named entities instead of character references is marginally more readable, but it unnecessarily complicates XML compatibility or migrating to XML, because using entities requires the presence of a DOCTYPE declaration or embedded DTD. This does not apply to <, &, ", &apos' and > which are pre-defined in XML.
If you are working with a modern environment, using Unicode characters as such is generally preferred because often (the textual) data can be used without parsing/interpretation (e.g. direct searches from the text), it is easier and it will probably lead to more readable and thus more easily maintainable code.
The characters you must encode are < and & and also " and ' when they appear in an attribute value and the same character is used as an attribute value delimiter. In theory you should also escape > when it appears as a part of a ]]> string that is not meant to end a CDATA section, but this is only for SGML compatibility and therefore not generally needed. These characters should be escaped using entities instead of character references. The need of escaping & applies also to URL values in <a href="..."> which unfortunately is commonly forgotten.
发布评论
评论(2)
当文档创建/编辑软件、数据存储或传输通道无法存储 Unicode 数据或保留其编码的字节流时,应使用字符引用。
实际上,这可能意味着需要使用遗留应用程序或遗留配置或遗留传输协议来完成工作。在这种情况下,工具链的某些部分可能仅支持 8 位编码,甚至仅支持 ASCII。在这种情况下,不可能像这样存储 Unicode 字符,因此恢复到除 ASCII 字符之外的所有字符的字符引用可能会很有用,因为这样可以避免从 8 位编码切换到 Unicode 时可能出现的令人讨厌的编码转换问题。使用命名实体而不是字符引用的可读性稍高,但它不必要地使 XML 兼容性或迁移到 XML 变得复杂,因为使用实体需要存在 DOCTYPE 声明或嵌入的 DTD。这不适用于
<
、&
、"
、&apos' 和
>
是在 XML 中预定义的。如果您正在使用现代环境,那么通常首选使用 Unicode 字符,因为通常可以使用(文本)数据而无需解析/解释(例如,从文本中直接搜索),这更容易,并且可能会导致更多结果可读且更易于维护的代码。
您必须编码的字符是
<
和&
以及"
和'
当它们出现在属性值中并且相同的字符用作属性值分隔符时,理论上,当它作为]]> 的一部分出现时,您也应该转义
字符串并不意味着结束 CDATA 部分,而是这仅是为了 SGML 兼容性,因此通常不需要使用实体而不是字符引用来转义>
。&
的需要也适用于Character references should be used when document creating/editing software, the data storage or a transport channel cannot store Unicode data or preserve the byte stream it is encoded to.
Practically this could mean that work needs to be done with legacy applications or with legacy configuration or with legacy transport protocols. In such cases it is possible that some part of the toolchain supports only 8-bit encodings or even ASCII only. Storing Unicode characters as such is not possible in such cases so reverting to character references on all but ASCII characters could be useful then, because that way you can avoid nasty encoding conversion problems that might appear when switching from 8-bit encodings to Unicode. Using named entities instead of character references is marginally more readable, but it unnecessarily complicates XML compatibility or migrating to XML, because using entities requires the presence of a DOCTYPE declaration or embedded DTD. This does not apply to
<
,&
,"
,&apos'
and>
which are pre-defined in XML.If you are working with a modern environment, using Unicode characters as such is generally preferred because often (the textual) data can be used without parsing/interpretation (e.g. direct searches from the text), it is easier and it will probably lead to more readable and thus more easily maintainable code.
The characters you must encode are
<
and&
and also"
and'
when they appear in an attribute value and the same character is used as an attribute value delimiter. In theory you should also escape>
when it appears as a part of a]]>
string that is not meant to end a CDATA section, but this is only for SGML compatibility and therefore not generally needed. These characters should be escaped using entities instead of character references. The need of escaping&
applies also to URL values in<a href="...">
which unfortunately is commonly forgotten.对我来说,编码字符意味着页面将更易于访问,更多浏览器将正确显示它。
我很懒,通常在需要时输入 unicode 字符(如 √、∞、æ),而且大多数情况下工作正常。
如果出现以下情况,您可能会遇到问题:
1)该数据无法存储
2)无法转移
3)无法显示
As for me encoding characters means that the page will be more accessible e.i. more browsers will display it correctly ext…
I'm lazy and usually enter the unicode chars (like √, ∞, æ) as they are if I need it, and mostly it works ok.
You can encounter problems if
1) this data can't be stored
2) can't be transferred
3) can't be displayed