UTF-8 和 HTML 实体有什么区别?
UTF-8 和 HTML 实体有什么区别?
What is difference between UTF-8 and HTML entities?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
UTF-8 和 HTML 实体有什么区别?
What is difference between UTF-8 and HTML entities?
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(5)
更多地将 UTF-8 视为一种无损和自同步将自然数列表映射到字节流的方法,以便您可以恢复自然数(无损),如果您只是落在流的“中间”,则不是一个大问题。 (自同步)
每个自然数恰好代表一个“字符”。
HTML 实体是一种表示相同自然数的方法,如下所示:
,代表自然数 127,在 unicode 中是DEL
字符。在 UTF-8 中,字节流为:
0111 1111
一旦超过 127,它就变成了多个八位字节,因此,128 变为:
1000 0001 1111 1111
。连续两个
DEL
字符变为0111 1111 0111 1111
。 UTF-8 的设计方式是,始终可以从字节流中检索“unicode 标量值”的原始列表,即使 4 个八位位组的字节流可以映射回 1 到 4 个不同的此类标量价值观。因此,UTF-8 被称为“可变长度”。See UTF-8 more as a means to losslessly and self-synchronising map a list of natural numbers to a bytestream so that you can get the natural numbers back (lossless) and if you just fall 'in the middle' of the stream that's not a big problem. (self-synchronizing)
Each natural number just happens to represent a 'character'.
HTML entities is a way to represent those same natural numbers in a way like:
, stands for the natural number 127, in unicode that being theDEL
character.In UTF-8 that's the bytestream:
0111 1111
Once you go above 127 it becomes more than one octet, therefore, 128 becomes:
1000 0001 1111 1111
.Two
DEL
chars in a row become0111 1111 0111 1111
. UTF-8 is designed in such a way, that it's always possible to retrieve the original list of 'unicode scalar values' from the bytestream, even though a bytestream of for instance 4 octets can map back to between 1 and 4 different of such scalar values. UTF-8 is thus 'variable length' as they call it.UTF-8 是一种字节级编码的编码方案。
HTML 实体提供了一种在标准(通常是 ASCII)字符空间中表达许多字符的方法。当 UTF-8 不可用时,它还使它们
更具人类可读性可读。如今 HTML 实体的主要目的是确保看起来像 HTML 的文本呈现为文本。例如,小于或大于运算符()放置时可能会意外地当意图是让它们呈现为文本时呈现为 HTML。
<
或>
)按特定顺序(即UTF-8 is an encoding scheme for byte-level encoding.
HTML entities provide a way to express many characters in the standard (usually ASCII) character space. It also makes them
more human readablereadable when UTF-8 is not available.The main purpose of HTML Entities today is to make sure text that looks like HTML renders as text. For example, the Less than or Greater than operators (
<
or>
) when placed in a certain order (i.e <text>) can accidentally render as HTML when the intent was for them to render as text.您在屏幕上看到的“A”实际上并不是在计算机中存储为“A”,而是由 1 和 0 组成的序列。 字符集或编码指定以这种方式对字符进行编码的方法。 ASCII 字符集 仅包含它可以编码的少数字符,几乎完全限于英语。但由于历史原因和当时的技术限制,它曾经是互联网的字符集(很早就)。
UTF-8 和 HTML 实体都可用于对不属于 ASCII 的字符进行编码。 HTML 实体通过赋予特殊字符序列特殊的含义来实现这一点。使用它,您可以仅使用 ASCII 字符对 ASCII 未涵盖的字符进行编码。 UTF-8 (Unicode) 通过简单地扩展字符集以包含更多字符来执行相同的操作。 HTML 实体仅在您费心解码的环境(通常是浏览器)中才“有效”。 UTF-8 字符在任何支持该字符集的应用程序中都是通用的。
仅包含 ASCII 涵盖的字符的文本:
包含 ASCII 未涵盖的欧洲字符的文本:
包含亚洲字符的文本,大多数肯定不被 ASCII 覆盖:
UTF-8 的问题是客户端需要理解 UTF-8。不过,在过去十年左右的时间里,这并没有引起人们的关注,因为所有现代计算机和浏览器都可以毫无问题地理解 UTF-8。 UTF-8 (Unicode) 几乎可以对当今地球上使用的所有字符进行编码(除了少数例外)。使用它,您可以“按原样”处理文本。它绝对应该是保存文本的首选编码。HTML
实体的问题在于普通字符具有特殊含义。当编写
ä
时,它具有“ä”的特殊含义。如果您确实打算编写“ä”,则需要将序列双重编码为ä
。HTML 实体也是出了名的不可读。您不想使用它们来编码普通文本中的“特殊”字符。在这种情况下,它们是一个拼凑在不适当的字符集上的东西。请改用 Unicode。
独立于所使用的字符集的 HTML 实体的重要用途是将 HTML 标记与文本分开。 HTML 还赋予特殊字符序列特殊的含义。
text
是正常的字符序列,但它对于 HTML 解析器具有特殊含义。如果您只想编写“text”,则需要将其编码为<b>text</b>
,因此 HTML 解析器不会将其误认为是 HTML 标签。The "A" you see here on screen is not actually stored as "A" in the computer, it's rather a sequence of 1's and 0's. A character set or encoding specifies a way to encode characters in such a way. The ASCII character set only includes a handful of characters it can encode, almost exclusively limited to characters of the English language. But for historical reasons and technical limitations of the time, it used to be the character set of the internet (very early on).
Both UTF-8 and HTML entities can be used to encode characters that are not part of ASCII. HTML entities achieve this by giving a special meaning to special sequences of characters. Using it you can encode characters not covered by ASCII using only ASCII characters. UTF-8 (Unicode) does the same by simply extending the character set to include more characters. HTML entities are only "valid" in an environment where you bother to decode them, which is usually a browser. UTF-8 characters are universal in any application that supports the character set.
Text containing only characters covered by ASCII:
Text containing European characters not covered by ASCII:
Text containing Asian characters, most certainly not covered by ASCII:
The problem with UTF-8 is that the client needs to understand UTF-8. For the last decade or so this has been of no concern though, as all modern computers and browsers have no problem understanding UTF-8. UTF-8 (Unicode) can encode virtually all characters in use today on this planet (with minor exceptions). Using it you can work with text "as-is". It should absolutely be the preferred encoding to save text in.
The problem with HTML entities is that normal characters take on a special meaning. When writing
ä
, it takes on the special meaning of "ä". If you actually intend to write "ä", you need to double encode the sequence asä
.HTML entities are also notoriously unreadable. You do not want to use them to encode "special" characters in normal text. In this capacity they're a kludge bolted onto an inadequate character set. Use Unicode instead.
The important use of HTML entities that is independent of the character set used is to separate HTML markup from text. HTML as well gives special meaning to special character sequences.
<b>text</b>
is a normal sequence of characters, but it has a special meaning for HTML parsers. If you intended to just write "<b>text</b>", you will need to encode it as<b>text</b>
, so the HTML parser doesn't mistake it for HTML tags.一吨。 HTML 实体主要用于转义 HTML 标记,以便可以以 HTML 形式显示(不要混淆显示与输出)。例如,
>
输出 >,而 > 输出 >。关闭一个标签。虽然您可以使用 HTML 实体生成完整的 Unicode,但它的效率非常低且丑陋。UTF-8 是 Unicode 的多字节编码,它涵盖了如何显示经典 US ASCII 代码页之外的字符,而无需切换代码页并尝试混合代码页。单个代码点(将其视为一个字符,尽管这并不真正正确)可以由 6 个字节的数据组成。它用于表示基本多语言平面 (BMP) 内外的任何字符,例如重音字符、东亚字符以及凯尔特树书写 (Ogham) 以及其他字符集。
A ton. HTML entities are primarily intended there to escape HTML-markup so it can be displayed in HTML (not mix up display vs output). For instance,
>
outputs a >, while > closes a tag. While you can produce full Unicode with HTML entities, it is very inefficient and downright ugly.UTF-8 is a multi-byte encoding for Unicode, which covers how to display characters outside of the classic US ASCII code page without resorting to switching code pages and attempting to mix code pages. A single code point (think of it as a character, though that is not truly correct) can be made up of 6 bytes of data. It is for representing any character in and outside of the basic multilingual plane (BMP), such as accented characters, east asian characters, as well as celtic tree writing (Ogham) amongst other character sets.
UTF-8 是一种编码,
htmlentities
是一种使用户输入安全地显示在页面上的功能,这样 HTML 标签就不会直接添加到标记中。请参阅手册。UTF-8 is an encoding,
htmlentities
is a function for making user input safe to display on the page, so that HTML tags are not added directly to the markup. See the manual.