Unicode 可打印字符的范围是多少?

发布于 2024-09-25 02:56:36 字数 69 浏览 8 评论 0 原文

谁能告诉我 Unicode 可打印字符的范围是多少? [例如 Ascii 可打印字符范围是 \u0020 - \u007f]

Can anybody please tell me what is the range of Unicode printable characters? [e.g. Ascii printable character range is \u0020 - \u007f]

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

罪#恶を代价 2024-10-02 02:56:36

请参阅,http://en.wikipedia.org/wiki/Unicode_control_characters

您可能想特别查看在 C0 和 C1 控制字符 http://en.wikipedia.org/wiki/C0_and_C1_control_codes

维基也就是说,C0 控制字符的范围是 U+0000—U+001F 和 U+007F(与 ASCII 的范围相同),C1 控制字符的范围是 U+0080—U+009F

(除了 C 控制)字符,Unicode 还具有数百个格式控制字符,例如零宽度非连接符(使字符间距更近)或双向文本控制。这种格式化控制字符相当分散。

更重要的是,你在做什么需要你了解 Unicode 的不可打印字符?更有可能的是,无论您尝试做什么,解决问题的方法都是错误的。

See, http://en.wikipedia.org/wiki/Unicode_control_characters

You might want to look especially at C0 and C1 control character http://en.wikipedia.org/wiki/C0_and_C1_control_codes

The wiki says, the C0 control character is in the range U+0000—U+001F and U+007F (which is the same range as ASCII) and C1 control character is in the range U+0080—U+009F

other than C-control character, Unicode also has hundreds of formatting control characters, e.g. zero-width non-joiner, which makes character spacing closer, or bidirectional text control. This formatting control characters are rather scattered.

More importantly, what are you doing that requires you to know Unicode's non-printable characters? More likely than not, whatever you're trying to do is the wrong approach to solve your problem.

坏尐絯℡ 2024-10-02 02:56:36

这是一个老问题,但它仍然有效,我认为关于这个主题,还有比现有答案所涵盖的更多有用但简短的内容。

Unicode

Unicode 定义字符属性

这些属性之一是“常规类别”,它具有主要类别和子类别。主要类别有字母、标记、标点符号、符号、分隔符和其他。

通过了解字符的属性,您可以决定是否认为它们可以在您的特定上下文中打印。

您必须始终记住,“字符”和“可打印”等术语通常很困难并且具有有趣的边缘情况。


编程语言支持

一些编程语言可以帮助解决这个问题。

例如,Go 语言有一个“unicode”包,它提供了许多有用的 Unicode 相关函数,包括以下两个:

func IsGraphic(r rune) bool

IsGraphic reports whether the rune is defined as a Graphic by Unicode. Such  
characters include letters, marks, numbers, punctuation, symbols, and spaces, 
from categories L, M, N, P, S, Zs. 

func IsPrint(r rune) bool

IsPrint reports whether the rune is defined as printable by Go. Such  
characters include letters, marks, numbers, punctuation, symbols, and  
the ASCII space character, from categories L, M, N, P, S and the ASCII  
space character. This categorization is the same as IsGraphic except  
that the only spacing character is ASCII space, U+0020.

请注意,它说“由 Go 定义为可打印”,而不是“由 Unicode 定义为可打印”。似乎有些深度是 Unicode 的巫师不敢探索的。


您对 Unicode 了解得越多,您就越会意识到人类书写系统是多么出人意料的多样化和不可思议的怪异

特别是某个特定的“字符”是否可打印并不总是显而易见的。

零宽度空格可以打印吗?什么时候可以打印连字符点?是否有字符的可打印性取决于它们在单词中的位置或与它们相邻的字符?组合字符总是可打印吗?


脚注

ASCII 可打印字符范围为 \u0020 - \u007f

不,不是。 \u007f 是 DEL,通常不被视为可打印字符。例如,它与标记为“DEL”的键盘键相关联,其最早的目的是命令从某些介质(显示器、文件等)删除字符。

事实上,许多 8 位字符集有许多非连续范围,这些范围是不可打印的。例如,参见 C0 和 C1 控件。

This is an old question, but it is still valid and I think there is more to usefully, but briefly, say on the subject than is covered by existing answers.

Unicode

Unicode defines properties for characters.

One of these properties is "General Category" which has Major classes and subclasses. The Major classes are Letter, Mark, Punctuation, Symbol, Separator, and Other.

By knowing the properties of your characters, you can decide whether you consider them printable in your particular context.

You must always remember that terms like "character" and "printable" are often difficult and have interesting edge-cases.


Programming Language support

Some programming languages assist with this problem.

For example, the Go language has a "unicode" package which provides many useful Unicode-related functions including these two:

func IsGraphic(r rune) bool

IsGraphic reports whether the rune is defined as a Graphic by Unicode. Such  
characters include letters, marks, numbers, punctuation, symbols, and spaces, 
from categories L, M, N, P, S, Zs. 

func IsPrint(r rune) bool

IsPrint reports whether the rune is defined as printable by Go. Such  
characters include letters, marks, numbers, punctuation, symbols, and  
the ASCII space character, from categories L, M, N, P, S and the ASCII  
space character. This categorization is the same as IsGraphic except  
that the only spacing character is ASCII space, U+0020.

Notice that it says "defined as printable by Go" not by "defined as printable by Unicode". It is almost as if there are some depths the wizards at Unicode dare not plumb.


Printable

The more you learn about Unicode, the more you realise how unexpectedly diverse and unfathomably weird human writing systems are.

In particular whether a particular "character" is printable is not always obvious.

Is a zero-width space printable? When is a hyphenation point printable? Are there characters whose printability depends on their position in a word or on what characters are adjacent to them? Is a combining-character always printable?


Footnotes

ASCII printable character range is \u0020 - \u007f

No it isn't. \u007f is DEL which is not normally considered a printable character. It is, for example, associated with the keyboard key labelled "DEL" whose earliest purpose was to command the deletion of a character from some medium (display, file etc).

In fact many 8-bit character sets have many non-consecutive ranges which are non-printable. See for example C0 and C1 controls.

鸠魁 2024-10-02 02:56:36

首先,您应该删除问题中的“UTF8”一词,它不相关(UTF8 只是 Unicode 的编码之一,它与您的问题正交)。

第二:Unicode 中“可打印/不可打印”的含义不太明确。也许您的意思是“图形字符”;人们甚至可以争论某个空间是否可打印/图形化。非图形字符基本上由控制字符组成:范围 0x00-0x0f 加上一些其他分散的字符。

无论如何,绝大多数 Unicode 字符(超过 200.000 个)都是“图形”的。但这当然并不意味着它们可以在您的环境中打印。

在我看来,如果您打算生成“随机可打印”unicode 字符串,并尝试包含所有“可打印”字符,这似乎是一个坏主意。

First, you should remove the word 'UTF8' in your question, it's not pertinent (UTF8 is just one of the encodings of Unicode, it's something orthogonal to your question).

Second: the meaning of "printable/non printable" is less clear in Unicode. Perhaps you mean a "graphical character" ; and one can even dispute if a space is printable/graphical. The non-graphical characters would consist, basically, of control characters: the range 0x00-0x0f plus some others that are scattered.

Anyway, the vast majority of Unicode characters (more than 200.000) are "graphical". But this certainly does not imply that they are printable in your environment.

It seems to me a bad idea, if you intend to generate a "random printable" unicode string, to try to include all "printable" characters.

夏雨凉 2024-10-02 02:56:36

您应该做的是选择一种字体,然后生成一个列表,其中包含为您的字体定义的 Unicode 字符的字形。您可以使用像 freetype 这样的字体库来测试字形(测试 FT_Get_Char_Index(...) != 0)。

What you should do is pick a font, and then generate a list of which Unicode characters have glyphs defined for your font. You can use a font library like freetype to test glyphs (test for FT_Get_Char_Index(...) != 0).

与酒说心事 2024-10-02 02:56:36

哪些字符有效?

目前,Unicode 定义为从 U+0000 开始,到 U+10FFFF 结束。第一个区块是“基本拉丁语”,涵盖 U+0000U+007F,最后一个区块是“补充私人使用区域-B” ,跨度 U+10000010FFFF。如果您想查看所有这些块,请参见此处:Wikipedia.org:Unicode 块;块列表

让我们来分析一下拉丁语块 1 中的有效/无效内容。

拉丁语块:TLDR

如果您有兴趣过滤掉任一不可见字符,则需要过滤掉:

  • U+0000U+0008:Control
  • U+000EU+001F:设备(即控制)
  • U+007F:删除(控制)
  • U+008D > 到 U+009F:设备(即控制)

拉丁语块:完整范围

这是拉丁语块,分为更小的部分...

  • U+0000 到 < code>U+0008:控制
  • U+0009U+000C:空格
  • U+000EU+ 001F:设备(即控制)
  • U+0020:空格
  • U+0021U+002F:符号
  • U+0030U+0039:数字
  • U+003AU+0040:符号
  • U+0041< /code> 到 U+005A:大写字母
  • U+005BU+0060:符号
  • U+0061U+007A:小写字母
  • U+007BU+007E:符号
  • U+007F:删除 ( Control)
  • U+0080U+008C:Latin1 补充符号。
  • U+008DU+009F:设备(即控制)
  • U+00A0:不间断空格。 (即  
  • U+00A1U+00BF:符号。
  • U+00C0U+00FF:重音字符。

其他块

Unicode 以支持非拉丁字符集而闻名,那么这些其他块是什么?这只是一个广泛的概述,请参阅wikipedia.org 页面以获取完整的列表。

拉丁语1 & Latin1 相关块

  • U+0000U+007F :基本拉丁语
  • U+0080U+00FF< /code> : Latin-1 补充
  • U+0100U+017F : Latin Extended-A
  • U+0180U+ 024F:拉丁扩展-B

可组合块

U+0250U+036F:3 个块。

非拉丁语,语言块

U+0370U+1C7F:55 个块。

非拉丁语语言补充块

U+1C80U+209F:11 个块。

符号块

U+20A0U+2BFF:22 个块。

古代语言块

U+2C00U+2C5F:1 个块(格拉哥里语)。

语言扩展块

U+2C60U+FFEF:66 个块。

特殊块

U+FFF0U+FFFF:1 块(特价)。

What characters are valid?

At present, Unicode is defined as starting from U+0000 and ending at U+10FFFF. The first block, Basic Latin, spans U+0000 to U+007F and the last block, Supplementary Private Use Area-B, spans U+100000 to 10FFFF. If you want to see all of these blocks, see here: Wikipedia.org: Unicode Block; List of Blocks.

Let's break down what's valid/invalid in the Latin Block1.

The Latin Block: TLDR

If you're interested in filtering out either invisible characters, you'll want to filter out:

  • U+0000 to U+0008: Control
  • U+000E to U+001F: Device (i.e., Control)
  • U+007F: Delete (Control)
  • U+008D to U+009F: Device (i.e., Control)

The Latin Block: Full Ranges

Here's the Latin block, broken up into smaller sections...

  • U+0000 to U+0008: Control
  • U+0009 to U+000C: Space
  • U+000E to U+001F: Device (i.e., Control)
  • U+0020: Space
  • U+0021 to U+002F: Symbols
  • U+0030 to U+0039: Numbers
  • U+003A to U+0040: Symbols
  • U+0041 to U+005A: Uppercase Letters
  • U+005B to U+0060: Symbols
  • U+0061 to U+007A: Lowercase Letters
  • U+007B to U+007E: Symbols
  • U+007F: Delete (Control)
  • U+0080 to U+008C: Latin1-Supplement symbols.
  • U+008D to U+009F: Device (i.e., Control)
  • U+00A0: Non-breaking space. (i.e.,  )
  • U+00A1 to U+00BF: Symbols.
  • U+00C0 to U+00FF: Accented characters.

The Other Blocks

Unicode is famous for supporting non-Latin character sets, so what are these other blocks? This is just a broad overview, see the wikipedia.org page for the full, complete list.

Latin1 & Latin1-Related Blocks

  • U+0000 to U+007F : Basic Latin
  • U+0080 to U+00FF : Latin-1 Supplement
  • U+0100 to U+017F : Latin Extended-A
  • U+0180 to U+024F : Latin Extended-B

Combinable blocks

U+0250 to U+036F: 3 Blocks.

Non-Latin, Language blocks

U+0370 to U+1C7F: 55 Blocks.

Non-Latin, Language Supplement blocks

U+1C80 to U+209F: 11 Blocks.

Symbol blocks

U+20A0 to U+2BFF: 22 Blocks.

Ancient Language blocks

U+2C00 to U+2C5F: 1 Block (Glagolitic).

Language Extensions blocks

U+2C60 to U+FFEF: 66 Blocks.

Special blocks

U+FFF0 to U+FFFF: 1 Block (Specials).

以歌曲疗慰 2024-10-02 02:56:36

采用与 @HoldOffHunger 相反的方法,列出不可打印字符的范围可能会更容易,并使用 not 来测试字符是否可打印。

采用正则表达式的风格(因此,如果您想要可打印字符,请放置^):

[\u0000-\u0008\u000B-\u001F\u007F-\u009F\u2000-\u200F\u2028-\u202F\u205F-\u206F\u3000\uFEFF]

这说明了诸如分隔符空格joiners

请注意,与他们的答案(忽略所有非拉丁语言的白名单)不同,此黑名单不会仅仅因为它们位于块中而允许不可打印的字符< em>带有可打印字符(他们的答案完全包括非拉丁语语言补充块作为“可打印”,即使它包含诸如“零宽度非连接符”之类的东西......) 。

但请注意,如果使用此解决方案或任何其他解决方案,例如为了卫生目的,您可能需要做一些比全面更换更细致的事情。
可以说在这种情况下,不间断空格应该更改为空格,而不是被删除,并且不可见分隔符应该替换为逗号有条件地

然后是无效的字符范围,[尚未]未使用或保留用于编码目的,以及 特定于语言的变体选择器..


注意,使用正则表达式时,如果默认情况下不是这样,则启用 unicode 识别(对于 javascript,它是通过 /.../u)。

您可以通过尝试使用一些多字节字符范围创建正则表达式来判断是否正确。
例如上面,加上无效字符范围 \u {E0100}-\u{E01EF} 在 JavaScript 中:

/[\u0000-\u0008\u000B-\u001F\u007F-\u009F\u2000-\u200F\u2028-\u202F \u205F-\u206F\u3000\uFEFF\u{E0100}-\u{E01EF}]/u

Unicode RegExp

不带 u \u{E0100} -\u{E01EF} 等于 \uDB40(\uDD00-\uDB40)\uDDEF,而不是 (\uDB40\uDD00)-(\uDB40\uDDEF) code>,如果替换,即使在正则表达式本身中不包含多字节 unicode,您也应该始终启用 u,因为您可能会破坏 文本中存在的代理对

Taking the opposite approach to @HoldOffHunger, it might be easier to list the ranges of non-printable characters, and use not to test if a character is printable.

In the style of Regex (so if you wanted printable characters, place a ^):

[\u0000-\u0008\u000B-\u001F\u007F-\u009F\u2000-\u200F\u2028-\u202F\u205F-\u206F\u3000\uFEFF]

Which accounts for things like separator spaces and joiners

Note that unlike their answer which is a whitelist that ignores all non-latin languages, this blacklist wont permit non-printable characters just because they're in blocks with printable characters (their answer wholly includes Non-Latin, Language Supplement blocks as 'printable', even though it contains things like 'zero-width non-joiner'..).

Be aware though, that if using this or any other solution, for sanitation for example, you may want to do something more nuanced than a blanket replace.
Arguably in that case, non-breaking spaces should change to space, not be removed, and invisible separator should be replaced with comma conditionally.

Then there's invalid character ranges, either [yet] unused or reserved for encoding purposes, and language-specific variation selectors..


NB when using regular expressions, that you enable unicode awareness if it isn't that way by default (for javascript it's via /.../u).

You can tell if you have it correct by attempting to create the regular expression with some multi-byte character ranges.
For example, the above, plus the invalid character range \u{E0100}-\u{E01EF} in javascript:

/[\u0000-\u0008\u000B-\u001F\u007F-\u009F\u2000-\u200F\u2028-\u202F\u205F-\u206F\u3000\uFEFF\u{E0100}-\u{E01EF}]/u

Unicode RegExp

Without u \u{E0100}-\u{E01EF} equates to \uDB40(\uDD00-\uDB40)\uDDEF, not (\uDB40\uDD00)-(\uDB40\uDDEF), and if replacing you should always enable u even when not including multbyte unicode in the regex itself as you might break surrogate pairs that exist in the text.

孤星 2024-10-02 02:56:36

一种方法是将每个字符渲染到纹理并手动检查它是否可见。该解决方案不包括空格。

我编写了这样一个程序,并用它来确定前 471859 个代码点内大约有 467241 可打印字符。我选择这个数字是因为它涵盖了 Unicode 的所有前 4 个平面,这些平面似乎包含所有可打印字符。请参阅https://en.wikipedia.org/wiki/Plane_(Unicode)

我非常想改进我的程序来生成范围列表,但现在我正在为需要立即答案的人提供以下内容:

https://editor.p5js.org/SamyBencherif/sketches/_OE8Y3kS9

我发布这个工具是因为我认为这个问题吸引了很多人他们正在寻找略有不同的了解可打印范围的应用程序。希望这有用,即使它没有完全回答问题。

One approach is to render each character to a texture and manually check if it is visible. This solution excludes spaces.

I've written such a program and used it to determine there are roughly 467241 printable characters within the first 471859 code points. I've selected this number because it covers all of the first 4 Planes of Unicode, which seem to contain all printable characters. See https://en.wikipedia.org/wiki/Plane_(Unicode)

I would much like to refine my program to produce the list of ranges, but for now here's what I am working with for anyone who needs immediate answers:

https://editor.p5js.org/SamyBencherif/sketches/_OE8Y3kS9

I am posting this tool because I think this question attracts a lot of people who are looking for slightly different applications of knowing printable ranges. Hopefully this is useful, even though it does not fully answer the question.

拥抱我好吗 2024-10-02 02:56:36

int 数据类型中可打印的 Unicode 字符范围(不包括十六进制)为 32 到 126。

The printable Unicode character range, excluding the hex, is 32 to 126 in the int datatype.

香草可樂 2024-10-02 02:56:36

Unicode,严格的术语,没有范围。数字可以无限。

你给出的不是 UTF8,它有 1 个字节用于 ASCII 字符。

至于范围,我认为可打印字符没有范围。它总是在发展。检查我上面给出的页面。

Unicode, stict term, has no range. Numbers can go infinite.

What you gave is not UTF8 which has 1 byte for ASCII characters.

As for the range, I believe there is no range of printable characters. It always evolves. Check the page I gave above.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文