Unicode 可打印字符的范围是多少?
谁能告诉我 Unicode 可打印字符的范围是多少? [例如 Ascii 可打印字符范围是 \u0020 - \u007f]
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
谁能告诉我 Unicode 可打印字符的范围是多少? [例如 Ascii 可打印字符范围是 \u0020 - \u007f]
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(9)
请参阅,http://en.wikipedia.org/wiki/Unicode_control_characters
您可能想特别查看在 C0 和 C1 控制字符 http://en.wikipedia.org/wiki/C0_and_C1_control_codes
维基也就是说,C0 控制字符的范围是 U+0000—U+001F 和 U+007F(与 ASCII 的范围相同),C1 控制字符的范围是 U+0080—U+009F
(除了 C 控制)字符,Unicode 还具有数百个格式控制字符,例如零宽度非连接符(使字符间距更近)或双向文本控制。这种格式化控制字符相当分散。
更重要的是,你在做什么需要你了解 Unicode 的不可打印字符?更有可能的是,无论您尝试做什么,解决问题的方法都是错误的。
See, http://en.wikipedia.org/wiki/Unicode_control_characters
You might want to look especially at C0 and C1 control character http://en.wikipedia.org/wiki/C0_and_C1_control_codes
The wiki says, the C0 control character is in the range U+0000—U+001F and U+007F (which is the same range as ASCII) and C1 control character is in the range U+0080—U+009F
other than C-control character, Unicode also has hundreds of formatting control characters, e.g. zero-width non-joiner, which makes character spacing closer, or bidirectional text control. This formatting control characters are rather scattered.
More importantly, what are you doing that requires you to know Unicode's non-printable characters? More likely than not, whatever you're trying to do is the wrong approach to solve your problem.
这是一个老问题,但它仍然有效,我认为关于这个主题,还有比现有答案所涵盖的更多有用但简短的内容。
Unicode
Unicode 定义字符属性。
这些属性之一是“常规类别”,它具有主要类别和子类别。主要类别有字母、标记、标点符号、符号、分隔符和其他。
通过了解字符的属性,您可以决定是否认为它们可以在您的特定上下文中打印。
您必须始终记住,“字符”和“可打印”等术语通常很困难并且具有有趣的边缘情况。
编程语言支持
一些编程语言可以帮助解决这个问题。
例如,Go 语言有一个“unicode”包,它提供了许多有用的 Unicode 相关函数,包括以下两个:
请注意,它说“由 Go 定义为可打印”,而不是“由 Unicode 定义为可打印”。似乎有些深度是 Unicode 的巫师不敢探索的。
。
您对 Unicode 了解得越多,您就越会意识到人类书写系统是多么出人意料的多样化和不可思议的怪异
特别是某个特定的“字符”是否可打印并不总是显而易见的。
零宽度空格可以打印吗?什么时候可以打印连字符点?是否有字符的可打印性取决于它们在单词中的位置或与它们相邻的字符?组合字符总是可打印吗?
脚注
不,不是。 \u007f 是 DEL,通常不被视为可打印字符。例如,它与标记为“DEL”的键盘键相关联,其最早的目的是命令从某些介质(显示器、文件等)删除字符。
事实上,许多 8 位字符集有许多非连续范围,这些范围是不可打印的。例如,参见 C0 和 C1 控件。
This is an old question, but it is still valid and I think there is more to usefully, but briefly, say on the subject than is covered by existing answers.
Unicode
Unicode defines properties for characters.
One of these properties is "General Category" which has Major classes and subclasses. The Major classes are Letter, Mark, Punctuation, Symbol, Separator, and Other.
By knowing the properties of your characters, you can decide whether you consider them printable in your particular context.
You must always remember that terms like "character" and "printable" are often difficult and have interesting edge-cases.
Programming Language support
Some programming languages assist with this problem.
For example, the Go language has a "unicode" package which provides many useful Unicode-related functions including these two:
Notice that it says "defined as printable by Go" not by "defined as printable by Unicode". It is almost as if there are some depths the wizards at Unicode dare not plumb.
Printable
The more you learn about Unicode, the more you realise how unexpectedly diverse and unfathomably weird human writing systems are.
In particular whether a particular "character" is printable is not always obvious.
Is a zero-width space printable? When is a hyphenation point printable? Are there characters whose printability depends on their position in a word or on what characters are adjacent to them? Is a combining-character always printable?
Footnotes
No it isn't. \u007f is DEL which is not normally considered a printable character. It is, for example, associated with the keyboard key labelled "DEL" whose earliest purpose was to command the deletion of a character from some medium (display, file etc).
In fact many 8-bit character sets have many non-consecutive ranges which are non-printable. See for example C0 and C1 controls.
首先,您应该删除问题中的“UTF8”一词,它不相关(UTF8 只是 Unicode 的编码之一,它与您的问题正交)。
第二:Unicode 中“可打印/不可打印”的含义不太明确。也许您的意思是“图形字符”;人们甚至可以争论某个空间是否可打印/图形化。非图形字符基本上由控制字符组成:范围 0x00-0x0f 加上一些其他分散的字符。
无论如何,绝大多数 Unicode 字符(超过 200.000 个)都是“图形”的。但这当然并不意味着它们可以在您的环境中打印。
在我看来,如果您打算生成“随机可打印”unicode 字符串,并尝试包含所有“可打印”字符,这似乎是一个坏主意。
First, you should remove the word 'UTF8' in your question, it's not pertinent (UTF8 is just one of the encodings of Unicode, it's something orthogonal to your question).
Second: the meaning of "printable/non printable" is less clear in Unicode. Perhaps you mean a "graphical character" ; and one can even dispute if a space is printable/graphical. The non-graphical characters would consist, basically, of control characters: the range 0x00-0x0f plus some others that are scattered.
Anyway, the vast majority of Unicode characters (more than 200.000) are "graphical". But this certainly does not imply that they are printable in your environment.
It seems to me a bad idea, if you intend to generate a "random printable" unicode string, to try to include all "printable" characters.
您应该做的是选择一种字体,然后生成一个列表,其中包含为您的字体定义的 Unicode 字符的字形。您可以使用像 freetype 这样的字体库来测试字形(测试 FT_Get_Char_Index(...) != 0)。
What you should do is pick a font, and then generate a list of which Unicode characters have glyphs defined for your font. You can use a font library like freetype to test glyphs (test for FT_Get_Char_Index(...) != 0).
哪些字符有效?
目前,Unicode 定义为从
U+0000
开始,到U+10FFFF
结束。第一个区块是“基本拉丁语”,涵盖U+0000
到U+007F
,最后一个区块是“补充私人使用区域-B” ,跨度U+100000
到10FFFF
。如果您想查看所有这些块,请参见此处:Wikipedia.org:Unicode 块;块列表。让我们来分析一下拉丁语块 1 中的有效/无效内容。
拉丁语块:TLDR
如果您有兴趣过滤掉任一不可见字符,则需要过滤掉:
U+0000
到U+0008
:ControlU+000E
至U+001F
:设备(即控制)U+007F
:删除(控制)U+008D
> 到U+009F
:设备(即控制)拉丁语块:完整范围
这是拉丁语块,分为更小的部分...
U+0000
到 < code>U+0008:控制U+0009
到U+000C
:空格U+000E
到U+ 001F
:设备(即控制)U+0020
:空格U+0021
到U+002F
:符号U+0030
到U+0039
:数字U+003A
到U+0040
:符号U+0041< /code> 到
U+005A
:大写字母U+005B
到U+0060
:符号U+0061
到U+007A
:小写字母U+007B
到U+007E
:符号U+007F
:删除 ( Control)U+0080
到U+008C
:Latin1 补充符号。U+008D
到U+009F
:设备(即控制)U+00A0
:不间断空格。 (即U+00A1
到U+00BF
:符号。U+00C0
到U+00FF
:重音字符。其他块
Unicode 以支持非拉丁字符集而闻名,那么这些其他块是什么?这只是一个广泛的概述,请参阅wikipedia.org 页面以获取完整的列表。
拉丁语1 & Latin1 相关块
U+0000
到U+007F
:基本拉丁语U+0080
到U+00FF< /code> : Latin-1 补充
U+0100
到U+017F
: Latin Extended-AU+0180
到U+ 024F
:拉丁扩展-B可组合块
U+0250
到U+036F
:3 个块。非拉丁语,语言块
U+0370
到U+1C7F
:55 个块。非拉丁语语言补充块
U+1C80
到U+209F
:11 个块。符号块
U+20A0
到U+2BFF
:22 个块。古代语言块
U+2C00
到U+2C5F
:1 个块(格拉哥里语)。语言扩展块
U+2C60
到U+FFEF
:66 个块。特殊块
U+FFF0
至U+FFFF
:1 块(特价)。What characters are valid?
At present, Unicode is defined as starting from
U+0000
and ending atU+10FFFF
. The first block, Basic Latin, spansU+0000
toU+007F
and the last block, Supplementary Private Use Area-B, spansU+100000
to10FFFF
. If you want to see all of these blocks, see here: Wikipedia.org: Unicode Block; List of Blocks.Let's break down what's valid/invalid in the Latin Block1.
The Latin Block: TLDR
If you're interested in filtering out either invisible characters, you'll want to filter out:
U+0000
toU+0008
: ControlU+000E
toU+001F
: Device (i.e., Control)U+007F
: Delete (Control)U+008D
toU+009F
: Device (i.e., Control)The Latin Block: Full Ranges
Here's the Latin block, broken up into smaller sections...
U+0000
toU+0008
: ControlU+0009
toU+000C
: SpaceU+000E
toU+001F
: Device (i.e., Control)U+0020
: SpaceU+0021
toU+002F
: SymbolsU+0030
toU+0039
: NumbersU+003A
toU+0040
: SymbolsU+0041
toU+005A
: Uppercase LettersU+005B
toU+0060
: SymbolsU+0061
toU+007A
: Lowercase LettersU+007B
toU+007E
: SymbolsU+007F
: Delete (Control)U+0080
toU+008C
: Latin1-Supplement symbols.U+008D
toU+009F
: Device (i.e., Control)U+00A0
: Non-breaking space. (i.e.,U+00A1
toU+00BF
: Symbols.U+00C0
toU+00FF
: Accented characters.The Other Blocks
Unicode is famous for supporting non-Latin character sets, so what are these other blocks? This is just a broad overview, see the wikipedia.org page for the full, complete list.
Latin1 & Latin1-Related Blocks
U+0000
toU+007F
: Basic LatinU+0080
toU+00FF
: Latin-1 SupplementU+0100
toU+017F
: Latin Extended-AU+0180
toU+024F
: Latin Extended-BCombinable blocks
U+0250
toU+036F
: 3 Blocks.Non-Latin, Language blocks
U+0370
toU+1C7F
: 55 Blocks.Non-Latin, Language Supplement blocks
U+1C80
toU+209F
: 11 Blocks.Symbol blocks
U+20A0
toU+2BFF
: 22 Blocks.Ancient Language blocks
U+2C00
toU+2C5F
: 1 Block (Glagolitic).Language Extensions blocks
U+2C60
toU+FFEF
: 66 Blocks.Special blocks
U+FFF0
toU+FFFF
: 1 Block (Specials).采用与 @HoldOffHunger 相反的方法,列出不可打印字符的范围可能会更容易,并使用
not
来测试字符是否可打印。采用正则表达式的风格(因此,如果您想要可打印字符,请放置
^
):这说明了诸如分隔符空格 和 joiners
请注意,与他们的答案(忽略所有非拉丁语言的白名单)不同,此黑名单不会仅仅因为它们位于块中而允许不可打印的字符< em>带有可打印字符(他们的答案完全包括
非拉丁语语言补充块
作为“可打印”,即使它包含诸如“零宽度非连接符”之类的东西......) 。但请注意,如果使用此解决方案或任何其他解决方案,例如为了卫生目的,您可能需要做一些比全面更换更细致的事情。
可以说在这种情况下,
不间断空格
应该更改为空格
,而不是被删除,并且不可见分隔符
应该替换为逗号有条件地。然后是无效的字符范围,[尚未]未使用或保留用于编码目的,以及 特定于语言的变体选择器..
注意,使用正则表达式时,如果默认情况下不是这样,则启用 unicode 识别(对于 javascript,它是通过
/.../u
)。您可以通过尝试使用一些多字节字符范围创建正则表达式来判断是否正确。
例如上面,加上无效字符范围
\u {E0100}-\u{E01EF}
在 JavaScript 中:/[\u0000-\u0008\u000B-\u001F\u007F-\u009F\u2000-\u200F\u2028-\u202F \u205F-\u206F\u3000\uFEFF\u{E0100}-\u{E01EF}]/u
不带
u
\u{E0100} -\u{E01EF}
等于\uDB40(\uDD00-\uDB40)\uDDEF
,而不是(\uDB40\uDD00)-(\uDB40\uDDEF)
code>,如果替换,即使在正则表达式本身中不包含多字节 unicode,您也应该始终启用u
,因为您可能会破坏 文本中存在的代理对。Taking the opposite approach to @HoldOffHunger, it might be easier to list the ranges of non-printable characters, and use
not
to test if a character is printable.In the style of Regex (so if you wanted printable characters, place a
^
):Which accounts for things like separator spaces and joiners
Note that unlike their answer which is a whitelist that ignores all non-latin languages, this blacklist wont permit non-printable characters just because they're in blocks with printable characters (their answer wholly includes
Non-Latin, Language Supplement blocks
as 'printable', even though it contains things like 'zero-width non-joiner'..).Be aware though, that if using this or any other solution, for sanitation for example, you may want to do something more nuanced than a blanket replace.
Arguably in that case,
non-breaking space
s should change tospace
, not be removed, andinvisible separator
should be replaced with comma conditionally.Then there's invalid character ranges, either [yet] unused or reserved for encoding purposes, and language-specific variation selectors..
NB when using regular expressions, that you enable unicode awareness if it isn't that way by default (for javascript it's via
/.../u
).You can tell if you have it correct by attempting to create the regular expression with some multi-byte character ranges.
For example, the above, plus the invalid character range
\u{E0100}-\u{E01EF}
in javascript:/[\u0000-\u0008\u000B-\u001F\u007F-\u009F\u2000-\u200F\u2028-\u202F\u205F-\u206F\u3000\uFEFF\u{E0100}-\u{E01EF}]/u
Without
u
\u{E0100}-\u{E01EF}
equates to\uDB40(\uDD00-\uDB40)\uDDEF
, not(\uDB40\uDD00)-(\uDB40\uDDEF)
, and if replacing you should always enableu
even when not including multbyte unicode in the regex itself as you might break surrogate pairs that exist in the text.一种方法是将每个字符渲染到纹理并手动检查它是否可见。该解决方案不包括空格。
我编写了这样一个程序,并用它来确定前 471859 个代码点内大约有 467241 可打印字符。我选择这个数字是因为它涵盖了 Unicode 的所有前 4 个平面,这些平面似乎包含所有可打印字符。请参阅https://en.wikipedia.org/wiki/Plane_(Unicode)
我非常想改进我的程序来生成范围列表,但现在我正在为需要立即答案的人提供以下内容:
https://editor.p5js.org/SamyBencherif/sketches/_OE8Y3kS9
我发布这个工具是因为我认为这个问题吸引了很多人他们正在寻找略有不同的了解可打印范围的应用程序。希望这有用,即使它没有完全回答问题。
One approach is to render each character to a texture and manually check if it is visible. This solution excludes spaces.
I've written such a program and used it to determine there are roughly 467241 printable characters within the first 471859 code points. I've selected this number because it covers all of the first 4 Planes of Unicode, which seem to contain all printable characters. See https://en.wikipedia.org/wiki/Plane_(Unicode)
I would much like to refine my program to produce the list of ranges, but for now here's what I am working with for anyone who needs immediate answers:
https://editor.p5js.org/SamyBencherif/sketches/_OE8Y3kS9
I am posting this tool because I think this question attracts a lot of people who are looking for slightly different applications of knowing printable ranges. Hopefully this is useful, even though it does not fully answer the question.
int 数据类型中可打印的 Unicode 字符范围(不包括十六进制)为 32 到 126。
The printable Unicode character range, excluding the hex, is 32 to 126 in the int datatype.
Unicode,严格的术语,没有范围。数字可以无限。
你给出的不是 UTF8,它有 1 个字节用于 ASCII 字符。
至于范围,我认为可打印字符没有范围。它总是在发展。检查我上面给出的页面。
Unicode, stict term, has no range. Numbers can go infinite.
What you gave is not UTF8 which has 1 byte for ASCII characters.
As for the range, I believe there is no range of printable characters. It always evolves. Check the page I gave above.