与代码页使用相关的术语和概念

发布于 2024-09-12 23:30:09 字数 1122 浏览 10 评论 0原文

我正在研究代码页,并且遇到了许多相互冲突的术语使用,甚至在不同的维基百科条目中也是如此。我只是找不到详细说明从开始到结束的整个字符处理过程的信息来源。精通该领域的人能否提出以下信息不准确或不正确的方法:

据我所知,字符表示的过程:

  • 我们从符号集开始(不确定这里的术语是否正确,可能是'脚本”)不与任何特定平台关联。例如,“西里尔字母”被理解为在 Windows 上下文中与 Linux 中指代相同的实体。

  • 这些集的成员通常是由供应商成组选择的,以形成特定于平台的字符集。平台可能会分配这些不同的代码,例如 Windows 上的 GDI 值(例如,0 代表 ANSI_CHARSET 以及此处提到的其他代码:http://asa.diac24.net/wiki/index.php?title=ASS:fe&printable=yes)。我找不到关于这些集合的太多信息,例如它们实际上是否是编码字符集,或者它们是否只是无序和抽象的。

  • 从这些集合中,开发了各个代码页,这些代码页似乎与 GDI 值具有一对一的映射。由于这些 GDI 值似乎代表与平台相关的集合,这是否意味着 Windows 代码页本质上是每个单独集合的编码版本?

我一直无法将这个想法与之前向我显示的链接(我已经丢失了)协调起来,该链接显示了这些 GDI 字符集和跨不同平台的代码页之间的一对多映射。这是准确的吗?这些 GDI 值是否指向可以开发不同平台上的不同代码页的集合?

  • 每个代码页将抽象字符集的一个成员映射到一个整数上以表示其在该集中的位置。对于上述网页中提到的“更简单”代码页,可以使用更精确的“字符映射表”术语来引用。这个术语是否值得考虑,或者区别是否太微妙且不重要?

  • 如果字体包含该代码点的代码点,则该字体会将代码点解析为字形,否则会报告失败。我还读到,字体可能会为其不支持的代码点返回自己的空白字形。应用程序可以区分这个空白字形和成功的解析吗,即。字体是否会返回带有此空白字形的错误代码?

我相信这就是我困惑的程度。这方面的任何澄清都是非常宝贵的。提前致谢。

I'm in the process of researching code pages and have come across many conflicting uses of terminology, even amongst different Wikipedia entries. I just can't find a source of information that spells out the entire character handling process from start to finish. Could someone well versed in this field suggest ways in which the following information is inaccurate or incorrect:

The process of character representation as far as I understand:

  • We start with sets of symbols (not sure of the correct terminology here, possibly 'scripts') that are not associated with any specific platform. 'The Cyrillic alphabet' is understood to refer to the same entity in the context of Windows as in Linux, for example.

  • Members of these sets are selected, generally in bunches, by vendors to form a platform specific character set. The platform might assign these various codes such as GDI values on Windows (eg. 0 for ANSI_CHARSET and the other codes mentioned here: http://asa.diac24.net/wiki/index.php?title=ASS:fe&printable=yes). I cannot find much information on these sets such as whether they are in fact coded character sets or if they are simply unordered and abstract.

  • From these sets, individual code pages are developed that appear to have a one to one mapping with GDI values. Since these GDI values appear to represent sets that are platform dependent, does this mean Windows code pages are essentially a coded version of each individual set?

I've been having trouble reconciling this idea with a link shown to me earlier (which I've lost) that showed a one to many mapping between these GDI charsets and code pages across different platforms. Is this accurate, do these GDI values point to sets from which different code pages across different platforms can be developed?

  • Each code page maps a member of an abstract character set onto an integer to represent its position in the set. In the case of the 'simpler' code pages mentioned on the above webpage, these can be referred to using the more precise 'character map' term. Is this term worth considering or is the distinction too subtle and unimportant?

  • A font resolves a code point to a glyph if it contains one for that code point, otherwise it reports a failure. I've also read that a font may return its own blank glyph for those code points which it doesn't support. Can an application distinguish between this blank glyph and a successful resolution, ie. does the font return an error code of sorts with this blank glyph?

I believe that's the extent of my confusion. Any clarification in this regard would be invaluable. Thanks in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

帥小哥 2024-09-19 23:30:10

您基本上是正确的:

  • 从已知字符的数量开始。
  • 选择该字符的子集(字符集)
  • 将它们映射到位模式(代码页和编码)
  • 通过将字符与字形组合(即使用字体、位模式和代码页/将位模式映射到字符的编码)。

跨平台,有类似的代码页。甚至在许多代码页中也存在类似的值到字符的映射。例如,Windows Latin、Mac Roman 和 unicode 共享前 127 个值的字符。有一些标准化(例如,http://en.wikipedia.org/wiki/Shift_JIS日语)的代码页,以便机器可以交互。

一般来说,对于新开发,您应该使用带有流行编码之一的 unicode 代码页。 UTF8 在大多数现代系统中都很流行。 UTF16LE 用于以 W 结尾的 Windows 系统调用。

You are essentially correct:

  • Start with the number of known characters.
  • Select a subset of this characters (a character set)
  • Map these to bit patterns (code page and encoding)
  • Render these to an output device by combining the character with a glyph (ie. using a font, a bit pattern, and a codepage/encoding that maps bit pattern to character).

Across platforms, there are similar code pages. And even across many code pages there are similar mappings of value to character. For example, Windows Latin, Mac Roman and unicode share characters for the first 127 values. There is some standardization (eg. http://en.wikipedia.org/wiki/Shift_JIS for Japanese) of codepages so that machines can interact.

Generally for new development, you should be using a unicode codepage with one of the popular encodings. UTF8 is popular on most modern systems. UTF16LE is used for Windows system calls ending in W.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文