HTML 中哪些字符需要转义?

发布于 2024-12-04 02:27:19 字数 142 浏览 1 评论 0原文

它们与 XML 相同吗,也许加上空格 ( )?

我发现了一些巨大的 HTML 转义字符列表,但我不认为它们必须被转义。我想知道什么需要被转义。

Are they the same as XML, perhaps plus the space one ( )?

I've found some huge lists of HTML escape characters but I don't think they must be escaped. I want to know what needs to be escaped.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

孤千羽 2024-12-11 02:27:20

确切的答案取决于上下文。一般来说,这些字符不得出现 ( HTML 5.2 §3.2.4.2.5):

文本节点和属性值必须由 Unicode 字符组成,不得包含 U+0000 字符,不得包含永久未定义的 Unicode 字符(非字符),并且不得包含空格字符以外的控制字符。该规范包括对文本节点和属性值的确切值的额外约束,具体取决于它们的精确上下文。

对于 HTML 中的元素,文本内容模型的约束还取决于元素的类型。例如,“<” textarea 元素内部不需要在 HTML 中转义,因为 textarea 是一个可转义的原始文本元素。

这些限制分散在整个规范中。例如,属性值 (§8.1.2.3< /a>) 不得包含 不明确的&符号并且可以是 (i) 为空,(ii) 在单引号内(因此不得包含 U+0027 撇号字符')、双引号内的 (iii)(不得包含 U+0022 引号字符 ")或 (iv) 不加引号 - 具有以下限制:

... 不得包含任何文字空格字符、任何 U+0022 引号字符 (")、U+0027 撇号字符 (')、U+003D 等于符号字符 (=)、U+003C 小于符号字符 (<)、U+003E 大于符号字符 (>) 或 U+0060 重音符号字符 (`),且不能为空字符串。

The exact answer depends on the context. In general, these characters must not be present (HTML 5.2 §3.2.4.2.5):

Text nodes and attribute values must consist of Unicode characters, must not contain U+0000 characters, must not contain permanently undefined Unicode characters (noncharacters), and must not contain control characters other than space characters. This specification includes extra constraints on the exact value of Text nodes and attribute values depending on their precise context.

For elements in HTML, the constraints of the Text content model also depends on the kind of element. For instance, an "<" inside a textarea element does not need to be escaped in HTML because textarea is an escapable raw text element.

These restrictions are scattered across the specification. E.g., attribute values (§8.1.2.3) must not contain an ambiguous ampersand and be either (i) empty, (ii) within single quotes (and thus must not contain U+0027 APOSTROPHE character '), (iii) within double quotes (must not contain U+0022 QUOTATION MARK character "), or (iv) unquoted — with the following restrictions:

... must not contain any literal space characters, any U+0022 QUOTATION MARK characters ("), U+0027 APOSTROPHE characters ('), U+003D EQUALS SIGN characters (=), U+003C LESS-THAN SIGN characters (<), U+003E GREATER-THAN SIGN characters (>), or U+0060 GRAVE ACCENT characters (`), and must not be the empty string.

有深☉意 2024-12-11 02:27:20

如果您的代码在浏览器中运行,您可以执行以下操作:

const escapeHTML = (() => {
  const el = document.createElement('div');
  return text => {
    el.innerText = text;
    return el.innerHTML;
  };
})();
escapeHTML('Use <script> carefully.'); // 'Use <script> carefully.'

If your code runs in browser, you can do this:

const escapeHTML = (() => {
  const el = document.createElement('div');
  return text => {
    el.innerText = text;
    return el.innerHTML;
  };
})();
escapeHTML('Use <script> carefully.'); // 'Use <script> carefully.'
淡淡的优雅 2024-12-11 02:27:19

如果您要在文档中需要文本内容的位置插入文本内容1您通常只需要转义与 XML 中相同的字符。在元素内部,这仅包括实体转义与符号 & 以及元素分隔符小于和大于符号 < >< /code>:

& becomes &
< becomes <
> becomes >

在属性值内部,您还必须转义您正在使用的引号字符:

" becomes "
' becomes ' (hex value) or ' (dec value)

在某些情况下,跳过转义其中一些字符可能是安全的,但我鼓励您在所有情况下转义所有五个字符,以减少机会犯错误。

如果您的文档编码不支持您正在使用的所有字符,例如您尝试在 ASCII 编码文档中使用表情符号,则还需要对这些字符进行转义。如今,大多数文档都使用完全支持 Unicode 的 UTF-8 编码进行编码,而这并不是必需的。

一般来说,您不应该将空格转义为    不是一个普通的空间,它是一个非破坏性空间空间。您可以使用这些代替普通空格来防止在两个单词之间插入换行符,或者插入        额外  &nb sp;    空间      无它会自动折叠,但这通常是一种罕见的情况。除非您有需要的设计约束,否则不要这样做。


1 “预期文本内容的位置”是指应用正常解析规则的元素或带引号的属性值内部。例如:

HERE

...

。我上面写的内容不适用于具有特殊解析规则或含义的内容,例如脚本或样式标记内部,或者作为元素或属性名称。例如:...< ;style>NOT-HERE, 或

...

在这些情况下,规则更加复杂,并且更容易引入安全漏洞。 我强烈建议您不要在任何这些位置插入动态内容。我见过有能力的具有安全意识的开发人员团队假设他们已经正确编码了这些值,但缺少边缘情况,从而引入了漏洞。通常有更安全的替代方案,例如将动态值放入属性中,然后使用 JavaScript 对其进行处理。

如果您必须这样做,请阅读开放 Web 应用程序安全项目的 XSS 预防规则以帮助理解您需要牢记的一些问题。

If you're inserting text content in your document in a location where text content is expected1, you typically only need to escape the same characters as you would in XML. Inside of an element, this just includes the entity escape ampersand & and the element delimiter less-than and greater-than signs < >:

& becomes &
< becomes <
> becomes >

Inside of attribute values you must also escape the quote character you're using:

" becomes "
' becomes ' (hex value) or ' (dec value)

In some cases it may be safe to skip escaping some of these characters, but I encourage you to escape all five in all cases to reduce the chance of making a mistake.

If your document encoding does not support all of the characters that you're using, such as if you're trying to use emoji in an ASCII-encoded document, you also need to escape those. Most documents these days are encoded using the fully Unicode-supporting UTF-8 encoding where this won't be necessary.

In general, you should not escape spaces as  .   is not a normal space, it's a non-breaking space. You can use these instead of normal spaces to prevent a line break from being inserted between two words, or to insert          extra        space       without it being automatically collapsed, but this is usually a rare case. Don't do this unless you have a design constraint that requires it.


1 By "a location where text content is expected", I mean inside of an element or quoted attribute value where normal parsing rules apply. For example: <p>HERE</p> or <p title="HERE">...</p>. What I wrote above does not apply to content that has special parsing rules or meaning, such as inside of a script or style tag, or as an element or attribute name. For example: <NOT-HERE>...</NOT-HERE>, <script>NOT-HERE</script>, <style>NOT-HERE</style>, or <p NOT-HERE="...">...</p>.

In these contexts, the rules are more complicated and it's much easier to introduce a security vulnerability. I strongly discourage you from ever inserting dynamic content in any of these locations. I have seen teams of competent security-aware developers introduce vulnerabilities by assuming that they had encoded these values correctly, but missing an edge case. There's usually a safer alternative, such as putting the dynamic value in an attribute and then handling it with JavaScript.

If you must, please read the Open Web Application Security Project's XSS Prevention Rules to help understand some of the concerns you will need to keep in mind.

慕巷 2024-12-11 02:27:19

这取决于上下文。 HTML 中一些可能的上下文:

  • 文档正文
  • 内部的通用属性
  • 内部脚本标签
  • 内部样式标签
  • 还有更多!

请参阅 OWASP 的 跨站脚本防护备忘单,尤其是“为什么我不能HTML 实体编码不受信任的数据?”和“XSS 预防规则”部分。但是,最好阅读整个文档。

It depends upon the context. Some possible contexts in HTML:

  • document body
  • inside common attributes
  • inside script tags
  • inside style tags
  • several more!

See OWASP's Cross Site Scripting Prevention Cheat Sheet, especially the "Why Can't I Just HTML Entity Encode Untrusted Data?" and "XSS Prevention Rules" sections. However, it's best to read the whole document.

街道布景 2024-12-11 02:27:19

基本上,有 三个主要字符 在您的 HTML 和 XML 文件中应该始终被转义,因此它们不会与其余标记交互,所以正如您可能期望的那样,其中两个将是语法包装器,即 <>,如下所示:

 1)  < (<)
    
 2)  > (>)
    
 3)  & (&)

另外,我们可以使用双引号 (") 作为 ",使用单引号 (') 作为 &apos

避免将动态内容放在 <脚本>

HTML 转义字符:完整列表:
http://www.theukwebdesigncompany.com/articles/entity-escape-characters.php

所以你需要转义 <, 或 &当后面跟随任何可以开始字符引用的内容时。此外,& 符号规则是引用属性的唯一规则,因为匹配的引号是唯一终止属性的规则。但如果您不想在那里终止属性值,请转义引号。

更改为 UTF-8 意味着重新保存文件:

为您的页面使用字符编码 UTF-8 意味着您可以避免需要
大多数人都会逃避并只与角色打交道。但请注意,要
更改文档的编码,仅更改是不够的
页面顶部或服务器上的编码声明。你
需要以该编码重新保存文档。寻求帮助理解
如何使用您的应用程序执行此操作请阅读在网络中设置编码
编写应用程序。

不可见或不明确的字符:

转义符的一个特别有用的角色是表示以下字符:
在表示中不可见或不明确。

一个示例是 Unicode 字符 U+200F 从右到左标记。这
字符可用于阐明双向文本中的方向性
(例如,当使用阿拉伯语或希伯来语脚本时)。它没有图形形式,
然而,因此很难看出这些字符在
文本,如果它们丢失或被遗忘,它们可能会产生意想不到的效果
后期编辑时的结果。使用 ‏ (或其数字字符
参考等效 ‏) ​​相反使其很容易被发现
这些字符。

不明确字符的一个示例是 U+00A0 NO-BREAK SPACE。这
空格类型可以防止断行,但它看起来就像任何其他
用作字符时的空格。使用做到了
非常清楚此类空格出现在文本中的位置。

Basically, there are three main characters which should be always escaped in your HTML and XML files, so they don't interact with the rest of the markups, so as you probably expect, two of them gonna be the syntax wrappers, which are <>, they are listed as below:

 1)  < (<)
    
 2)  > (>)
    
 3)  & (&)

Also we may use double-quote (") as " and the single quote (') as &apos

Avoid putting dynamic content in <script> and <style>.These rules are not for applied for them. For example, if you have to include JSON in a , replace < with \x3c, the U+2028 character with \u2028, and U+2029 with \u2029 after JSON serialisation.)

HTML Escape Characters: Complete List:
http://www.theukwebdesigncompany.com/articles/entity-escape-characters.php

So you need to escape <, or & when followed by anything that could begin a character reference. Also The rule on ampersands is the only such rule for quoted attributes, as the matching quotation mark is the only thing that will terminate one. But if you don’t want to terminate the attribute value there, escape the quotation mark.

Changing to UTF-8 means re-saving your file:

Using the character encoding UTF-8 for your page means that you can avoid the need for
most escapes and just work with characters. Note, however, that to
change the encoding of your document, it is not enough to just change
the encoding declaration at the top of the page or on the server. You
need to re-save your document in that encoding. For help understanding
how to do that with your application read Setting encoding in web
authoring applications.

Invisible or ambiguous characters:

A particularly useful role for escapes is to represent characters that
are invisible or ambiguous in presentation.

One example would be Unicode character U+200F RIGHT-TO-LEFT MARK. This
character can be used to clarify directionality in bidirectional text
(eg. when using the Arabic or Hebrew scripts). It has no graphic form,
however, so it is difficult to see where these characters are in the
text, and if they are lost or forgotten they could create unexpected
results during later editing. Using ‏ (or its numeric character
reference equivalent ‏) instead makes it very easy to spot
these characters.

An example of an ambiguous character is U+00A0 NO-BREAK SPACE. This
type of space prevents line breaking, but it looks just like any other
space when used as a character. Using   makes it
quite clear where such spaces appear in the text.

南巷近海 2024-12-11 02:27:19

如果你想使用 JavaScript 转义标记字符串,可以使用:

或者,如果您不想引入依赖项,这里是同样的事情,尽管速度稍慢,因为它使用 split/map/join 而不是 charCodeAt /子字符串。

function escapeMarkup (dangerousInput) {
  const dangerousString = String(dangerousInput);
  const matchHtmlRegExp = /["'&<>]/;
  const match = matchHtmlRegExp.exec(dangerousString);
  if (!match) {
    return dangerousInput;
  }

  const encodedSymbolMap = {
    '"': '"',
    '\'': ''',
    '&': '&',
    '<': '<',
    '>': '>'
  };
  const dangerousCharacters = dangerousString.split('');
  const safeCharacters = dangerousCharacters.map(function (character) {
    return encodedSymbolMap[character] || character;
  });
  const safeString = safeCharacters.join('');
  return safeString;
}

If you want to escape a string of markup using JavaScript there is:

or, if you don't want to pull in a dependency, here is the same thing, though slightly slower because it uses split/map/join instead of charCodeAt/substring.

function escapeMarkup (dangerousInput) {
  const dangerousString = String(dangerousInput);
  const matchHtmlRegExp = /["'&<>]/;
  const match = matchHtmlRegExp.exec(dangerousString);
  if (!match) {
    return dangerousInput;
  }

  const encodedSymbolMap = {
    '"': '"',
    '\'': ''',
    '&': '&',
    '<': '<',
    '>': '>'
  };
  const dangerousCharacters = dangerousString.split('');
  const safeCharacters = dangerousCharacters.map(function (character) {
    return encodedSymbolMap[character] || character;
  });
  const safeString = safeCharacters.join('');
  return safeString;
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文