为什么 ASCII 和 ISO-8859-1 编码没有成为历史?

发布于 2024-09-16 14:47:04 字数 380 浏览 6 评论 0原文

在我看来,如果 UTF-8 是唯一随处使用的编码,那么代码问题就会少很多:

  • 甚至不需要考虑编码问题。
  • 混合 1-2 字节字符流没有问题,因为所有内容都使用 2 个字节。
  • 浏览器不需要等待 标记指定编码才能执行任何操作。 StackOverflow 甚至没有元标记,导致浏览器首先下载整个页面,从而减慢页面渲染速度。
  • 您永远不会在旧网页上看到? 和其他随机符号(例如代替Microsoft Word 的特殊[阅读:可怕] 引号)。
  • UTF-8 可以表示更多的字符。
  • 其他的我暂时想不起来。

那么为什么劣质编码没有被从太空中消灭掉呢?

It seems to me if UTF-8 was the only encoding used everywhere ever, there would be a lot less issues with code:

  • Don't even need to think about encoding issues.
  • No issues with mixed 1-2-byte character streaming, because everything uses 2 bytes.
  • Browsers don't need to wait for the <meta> tag specifying encoding before they can do anything. StackOverflow doesn't even have the meta tag, making browsers download the full page first, slowing page rendering.
  • You would never see ? and other random symbols on old web pages (e.g. in place of Microsoft Word's special [read: horrible] quotes).
  • More characters can be represented in UTF-8.
  • Other things I can't think of right now.

So why haven't the inferior encodings been nuked from space?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

太阳哥哥 2024-09-23 14:47:04
  • 甚至不需要考虑编码问题。

真的。除了仍然采用旧 ASCII 格式的所有数据之外。

  • 混合 1-2 字节字符流没有问题,因为所有内容都使用 2 个字节。

不正确。 UTF-8 是可变长度的,从 1 到 6 个字节左右。

  • 浏览器无需等待指定编码的标记即可执行任何操作。 StackOverflow 甚至没有元标记,导致浏览器首先下载整个页面,从而减慢页面渲染速度。

浏览器通常不会等待整个页面,而是根据页面数据的第一部分进行猜测。

  • 你永远不会看到?以及旧网页上的其他随机符号(例如,代替 Microsoft Word 的特殊[读:可怕] 引号)。

除了所有那些使用其他非 UTF-8 编码的其他旧网页(非英语世界相当大)。

  • UTF-8 可以表示更多字符。

真的。您的数据验证问题也变得更加困难。

  • Don't even need to think about encoding issues.

True. Except for all the data that's still in the old ASCII format.

  • No issues with mixed 1-2-byte character streaming, because everything uses 2 bytes.

Incorrect. UTF-8 is variable length, from 1 to 6 or so bytes.

  • Browsers don't need to wait for the tag specifying encoding before they can do anything. StackOverflow doesn't even have the meta tag, making browsers download the full page first, slowing page rendering.

Browsers don't generally wait for the full page, they make a guess based on the first part of the page data.

  • You would never see ? and other random symbols on old web pages (e.g. in place of Microsoft Word's special [read: horrible] quotes).

Except for all those other old web pages that use other non-UTF-8 encodings (the non-English speaking world is pretty big).

  • More characters can be represented in UTF-8.

True. Your problems of data validation just got harder, too.

豆芽 2024-09-23 14:47:04

为什么 EBCDIC、Baudot 和 Morse 仍未从轨道上发射核武器?为什么马鞭制造商在戈特利布·戴姆勒运出他的第一辆汽车后第二天没有关门?

让一项技术成为历史需要非零时间。

Why are EBCDIC, Baudot, and Morse still not nuked from orbit? Why did the buggy-whip manufacturers not close their doors the day after Gottlieb Daimler shipped his first automobile?

Relegating a technology to history takes non-zero time.

人心善变 2024-09-23 14:47:04

混合 1-2 字节没有问题
字符流,因为
一切都使用 2 个字节。

根本不是真的。 UTF-8 是一种混合宽度的 1、2、3 和 4 字节编码。您可能一直在考虑 UTF-16,但即便如此,4 字节字符也已经有一段时间了。如果您想要“简单”的固定宽度编码,则需要 UTF-32。

你永远不会看到?以及其他随机的
旧网页上的符号

即使使用 UTF-8 网页,您仍然可能没有支持每个 Unicode 字符的字体,因此这仍然是一个问题。

可以表示更多的字符
UTF-8。

有时这是一个缺点。拥有更多的字符意味着需要更多的位来对字符进行编码。并跟踪哪些是字母、数字等。并存储用于显示这些字符的字体。并处理其他与 Unicode 相关的复杂性,例如标准化。

对于具有千兆字节 RAM 的现代计算机来说,这可能不是问题,但不要指望您的 TI-83 很快就会支持 Unicode。


但是,如果您确实需要这些额外的字符,那么使用 UTF-8 比使用无数不同的 8 位字符编码(加上一些非自编码)要容易得多。 -同步东亚多字节编码)。

那么为什么没有劣质编码呢
被太空中的核武器袭击过吗?

在很大程度上,这是因为“劣等”编程语言尚未从太空中消失。许多代码仍然是用 C 和 C++(甚至 COBOL!)等早于 Unicode 的语言编写的,但仍然没有很好的支持。

非常希望我们能够摆脱这样的情况:一些库使用以 UTF-8 编码的基于 char 的字符串,而其他库则认为 char 是为了遗留编码和 Unicode 应始终使用 wchar_t,然后您必须处理 wchar_t 是 UTF-16 还是 UTF-32(或两者都不是)。

No issues with mixed 1-2-byte
character streaming, because
everything uses 2 bytes.

Not true at all. UTF-8 is a mixed-width 1, 2, 3, and 4-byte encoding. You may have been thinking of UTF-16, but even that has had 4-byte characters for a while. If you want a “simple” fixed-width encoding, you need UTF-32.

You would never see ? and other random
symbols on old web pages

Even with UTF-8 web pages, you still might not have a font that supports every Unicode character, so this is still a problem.

More characters can be represented in
UTF-8.

Sometimes this is a disadvantage. Having more characters means more bits are required to encode the characters. And to keep track of which ones are letters, digits, etc. And to store the fonts for displaying those characters. And to deal with additional Unicode-related complexities like normalization.

This is probably a non-issue for modern computers with gigabytes of RAM, but don't expect your TI-83 to support Unicode any time soon.


But still, if you do need those extra characters, it's way easier to work with UTF-8 than it is to work with than having zillions of different 8-bit character encodings (plus a few non-self-synchronizing East Asian multibyte encodings).

So why haven't the inferior encodings
been nuked from space?

In large part, this is because the “inferior” programming languages haven't been nuked from space. Lots of code is still written in languages like C and C++ (and even COBOL!) that predate Unicode and still don't have good support for it.

I badly wish we get rid of the situation where some libraries use char-based strings encoded in UTF-8 while others think char is for legacy encodings and Unicode should always use wchar_t and then you have to deal with whether wchar_t is UTF-16 or UTF-32 (or neither).

且行且努力 2024-09-23 14:47:04

我不认为 UTF-8 使用“2 位”,而是可变长度。此外,许多操作系统级别的代码分别是 UTF-16 和 UTF-32,这意味着可以在 ASCII 或 ISO-8859-1 之间选择拉丁编码。

I don't think UTF-8 uses "2 bits" it's variable length. Also a lot of OS level code is UTF-16 and UTF-32 respectively, which means the choice is between ASCII or ISO-8859-1 for latin encodings.

岁月蹉跎了容颜 2024-09-23 14:47:04

好吧,你的问题有点抱怨“为什么世界如此糟糕”。正是因为如此。使用 UTF-8 以外的其他编码编写的页面来自操作系统对 UTF-8 的支持很差以及 UTF-8 尚未成为事实上的标准的时代。

只要有人不更改这些页面,这些页面就会保留其原始编码,这在许多情况下不太可能发生。他们中的许多人不再得到任何人的支持。

互联网上也有很多非 unicode 编码的文档,格式多种多样。有人可以转换它们,但如上所述,需要付出很大的努力。

因此,对非 unicode 的支持也必须保留。

在当前时代,请遵守这样的规则:当有人使用非 unicode 编码时,小猫就会死亡。

Well, your question is a bit why-world-is-so-bad complaint. It is because it is so. The pages written in other encodings than UTF-8 come from the times when UTF-8 was badly supported by operating systems and when UTF-8 was not yet de-facto standard.

This pages will stay in their original encoding as long as someone will not change them, which is in many cases not very probable. Many of them are no longer supported by anyone.

There are also a lot of documents with non-unicode encoding in the internet, in many formats. Someone COULD convert them, but it, as above, requires a lot of effort.

So, the support for non-unicode must also stay.

And for the current times, keep as the rule that when someone uses non-unicode encoding, a kitten dies.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文