HTML 中哪些字符需要转义?
它们与 XML 相同吗,也许加上空格 (
)?
我发现了一些巨大的 HTML 转义字符列表,但我不认为它们必须被转义。我想知道什么需要被转义。
Are they the same as XML, perhaps plus the space one (
)?
I've found some huge lists of HTML escape characters but I don't think they must be escaped. I want to know what needs to be escaped.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
确切的答案取决于上下文。一般来说,这些字符不得出现 ( HTML 5.2 §3.2.4.2.5):
这些限制分散在整个规范中。例如,属性值 (§8.1.2.3< /a>) 不得包含 不明确的&符号并且可以是 (i) 为空,(ii) 在单引号内(因此不得包含 U+0027 撇号字符
'
)、双引号内的 (iii)(不得包含 U+0022 引号字符"
)或 (iv) 不加引号 - 具有以下限制:The exact answer depends on the context. In general, these characters must not be present (HTML 5.2 §3.2.4.2.5):
These restrictions are scattered across the specification. E.g., attribute values (§8.1.2.3) must not contain an ambiguous ampersand and be either (i) empty, (ii) within single quotes (and thus must not contain U+0027 APOSTROPHE character
'
), (iii) within double quotes (must not contain U+0022 QUOTATION MARK character"
), or (iv) unquoted — with the following restrictions:如果您的代码在浏览器中运行,您可以执行以下操作:
If your code runs in browser, you can do this:
如果您要在文档中需要文本内容的位置插入文本内容1,您通常只需要转义与 XML 中相同的字符。在元素内部,这仅包括实体转义与符号
&
以及元素分隔符小于和大于符号<
>< /code>:
在属性值内部,您还必须转义您正在使用的引号字符:
在某些情况下,跳过转义其中一些字符可能是安全的,但我鼓励您在所有情况下转义所有五个字符,以减少机会犯错误。
如果您的文档编码不支持您正在使用的所有字符,例如您尝试在 ASCII 编码文档中使用表情符号,则还需要对这些字符进行转义。如今,大多数文档都使用完全支持 Unicode 的 UTF-8 编码进行编码,而这并不是必需的。
一般来说,您不应该将空格转义为
。
不是一个普通的空间,它是一个非破坏性空间空间。您可以使用这些代替普通空格来防止在两个单词之间插入换行符,或者插入 额外 &nb sp; 空间 无它会自动折叠,但这通常是一种罕见的情况。除非您有需要的设计约束,否则不要这样做。
1 “预期文本内容的位置”是指应用正常解析规则的元素或带引号的属性值内部。例如:
HERE
或
...
。我上面写的内容不适用于具有特殊解析规则或含义的内容,例如脚本或样式标记内部,或者作为元素或属性名称。例如:
...
、、
< ;style>NOT-HERE
, 或...
。
在这些情况下,规则更加复杂,并且更容易引入安全漏洞。 我强烈建议您不要在任何这些位置插入动态内容。我见过有能力的具有安全意识的开发人员团队假设他们已经正确编码了这些值,但缺少边缘情况,从而引入了漏洞。通常有更安全的替代方案,例如将动态值放入属性中,然后使用 JavaScript 对其进行处理。
如果您必须这样做,请阅读开放 Web 应用程序安全项目的 XSS 预防规则以帮助理解您需要牢记的一些问题。
If you're inserting text content in your document in a location where text content is expected1, you typically only need to escape the same characters as you would in XML. Inside of an element, this just includes the entity escape ampersand
&
and the element delimiter less-than and greater-than signs<
>
:Inside of attribute values you must also escape the quote character you're using:
In some cases it may be safe to skip escaping some of these characters, but I encourage you to escape all five in all cases to reduce the chance of making a mistake.
If your document encoding does not support all of the characters that you're using, such as if you're trying to use emoji in an ASCII-encoded document, you also need to escape those. Most documents these days are encoded using the fully Unicode-supporting UTF-8 encoding where this won't be necessary.
In general, you should not escape spaces as
.
is not a normal space, it's a non-breaking space. You can use these instead of normal spaces to prevent a line break from being inserted between two words, or to insert extra space without it being automatically collapsed, but this is usually a rare case. Don't do this unless you have a design constraint that requires it.
1 By "a location where text content is expected", I mean inside of an element or quoted attribute value where normal parsing rules apply. For example:
<p>HERE</p>
or<p title="HERE">...</p>
. What I wrote above does not apply to content that has special parsing rules or meaning, such as inside of a script or style tag, or as an element or attribute name. For example:<NOT-HERE>...</NOT-HERE>
,<script>NOT-HERE</script>
,<style>NOT-HERE</style>
, or<p NOT-HERE="...">...</p>
.In these contexts, the rules are more complicated and it's much easier to introduce a security vulnerability. I strongly discourage you from ever inserting dynamic content in any of these locations. I have seen teams of competent security-aware developers introduce vulnerabilities by assuming that they had encoded these values correctly, but missing an edge case. There's usually a safer alternative, such as putting the dynamic value in an attribute and then handling it with JavaScript.
If you must, please read the Open Web Application Security Project's XSS Prevention Rules to help understand some of the concerns you will need to keep in mind.
这取决于上下文。 HTML 中一些可能的上下文:
请参阅 OWASP 的 跨站脚本防护备忘单,尤其是“为什么我不能HTML 实体编码不受信任的数据?”和“XSS 预防规则”部分。但是,最好阅读整个文档。
It depends upon the context. Some possible contexts in HTML:
See OWASP's Cross Site Scripting Prevention Cheat Sheet, especially the "Why Can't I Just HTML Entity Encode Untrusted Data?" and "XSS Prevention Rules" sections. However, it's best to read the whole document.
基本上,有 三个主要字符 在您的 HTML 和 XML 文件中应该始终被转义,因此它们不会与其余标记交互,所以正如您可能期望的那样,其中两个将是语法包装器,即 <>,如下所示:
另外,我们可以使用双引号 (") 作为 ",使用单引号 (') 作为 &apos
避免将动态内容放在
<脚本>
和。这些规则不适用于它们。例如,如果您必须在 中包含 JSON,请替换 < JSON 序列化后,使用 \x3c、U+2028 字符和 \u2028 以及 U+2029 和 \u2029。)
HTML 转义字符:完整列表:
http://www.theukwebdesigncompany.com/articles/entity-escape-characters.php
所以你需要转义 <, 或 &当后面跟随任何可以开始字符引用的内容时。此外,& 符号规则是引用属性的唯一规则,因为匹配的引号是唯一终止属性的规则。但如果您不想在那里终止属性值,请转义引号。
Basically, there are three main characters which should be always escaped in your HTML and XML files, so they don't interact with the rest of the markups, so as you probably expect, two of them gonna be the syntax wrappers, which are <>, they are listed as below:
Also we may use double-quote (") as " and the single quote (') as &apos
Avoid putting dynamic content in
<script>
and<style>
.These rules are not for applied for them. For example, if you have to include JSON in a , replace < with \x3c, the U+2028 character with \u2028, and U+2029 with \u2029 after JSON serialisation.)HTML Escape Characters: Complete List:
http://www.theukwebdesigncompany.com/articles/entity-escape-characters.php
So you need to escape <, or & when followed by anything that could begin a character reference. Also The rule on ampersands is the only such rule for quoted attributes, as the matching quotation mark is the only thing that will terminate one. But if you don’t want to terminate the attribute value there, escape the quotation mark.
如果你想使用 JavaScript 转义标记字符串,可以使用:
或者,如果您不想引入依赖项,这里是同样的事情,尽管速度稍慢,因为它使用
split/map/join
而不是charCodeAt /子字符串。
If you want to escape a string of markup using JavaScript there is:
or, if you don't want to pull in a dependency, here is the same thing, though slightly slower because it uses
split/map/join
instead ofcharCodeAt/substring
.