如何生成所有可能的unicode字符?

发布于 2025-01-16 08:16:18 字数 1319 浏览 2 评论 0原文

如果我们输入字母,我们会得到英文字母表中的所有小写字母。但是,还有更多可能的字符,例如 äé 等。还有诸如 $( 之类的符号。我发现 这张 unicode 字符表正是我所需要的。当然,我不想在一个向量中复制并粘贴数百个可能的 unicode 字符。

到目前为止我尝试过的:table 给出(部分)unicode 字符的小数,例如,请参阅以下小表:

Glyph    Decimal    Unicode    Usage in R
!        33         U+0021     "\U0021"

因此,如果输入 "\U0021" 我们会得到 !。此外,paste0("U", format(as.hexmode(33), width= 4、flag="0")) 返回 "U0021" 这与我需要的非常接近,但添加 \ 会导致错误:

paste0("\U", format(as.hexmode(33), width= 4, flag="0"))
Error: '\U' used without hex digits in character string starting ""\U"

我被卡住了恐怕即使我弄清楚如何使用 as.hexmode() 将数字转换为字符,仍然存在所有 unicode 字符都没有小数的问题(请参阅表格,小数以 591 结尾)。

知道如何生成一个包含 链接中列出的所有 unicode 字符的向量吗?

(这个问题始于一个现实世界的问题,但现在我主要只是渴望知道如何做到这一点。)

If we type in letters we get all lowercase letters from english alphabet. However, there are many more possible characters like ä, é and so on. And there are symbols like $ or (, too. I found this table of unicode characters which is exactly what I need. Of course I do not want to copy and paste hundreds of possible unicode characters in one vector.

What I've tried so far: The table gives the decimals for (some of) the unicode characters. For example, see the following small table:

Glyph    Decimal    Unicode    Usage in R
!        33         U+0021     "\U0021"

So if type "\U0021" we get a !. Further, paste0("U", format(as.hexmode(33), width= 4, flag="0")) returns "U0021" which is quite close to what I need but adding \ results in an error:

paste0("\U", format(as.hexmode(33), width= 4, flag="0"))
Error: '\U' used without hex digits in character string starting ""\U"

I am stuck. And I am afraid even if I figure out how to transform numbers to characters usings as.hexmode() there is still the problem that there are not Decimals for all unicode characters (see table, Decimals end with 591).

Any idea how to generate a vector with all the unicode characters listed in the table linked?

(The question started with a real world problem but now I am mostly simply eager to know how to do this.)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

月光色 2025-01-23 08:16:18

可能有更简单的方法可以做到这一点,但这里是。 Unicode 包包含您需要的一切。

首先,我们可以获得 unicode 脚本和块范围的列表:

library(Unicode)  

uranges <- u_scripts()

检查我们得到了什么:

head(uranges, 3)

$Adlam
[1] U+1E900..U+1E943 U+1E944..U+1E94A U+1E94B          U+1E950..U+1E959 U+1E95E..U+1E95F

$Ahom
 [1] U+11700..U+1171A U+1171D..U+1171F U+11720..U+11721 U+11722..U+11725 U+11726          U+11727..U+1172B U+11730..U+11739 U+1173A..U+1173B U+1173C..U+1173E U+1173F         
[11] U+11740..U+11746

$Anatolian_Hieroglyphs
[1] U+14400..U+14646

接下来我们可以将范围转换为其序列。

expand_uranges <- lapply(uranges, as.u_char_seq)

要获得所有字符的单个向量,我们可以将其取消列出。这并不容易使用,所以实际上最好将它们保留为列表:

all_unicode_chars <- unlist(expand_uranges)

# The Wikipedia page linked states there are 144,697 characters 
length(all_unicode_chars)
[1] 144762

似乎是所有这些,并且页面需要更新。它们存储为整数,以便打印它们(假设支持字形),我们可以这样做,例如打印日语片假名:

intToUtf8(expand_uranges$Katakana[[1]])

[1] "ァアィイゥウェエォオカガキギクグケゲコゴサザシジスズセゼソゾタダチヂッツヅテデトドナニヌネノハバパヒビピフブプヘベペホボポマミムメモャヤュユョヨラリルレロヮワヰヱヲンヴヵヶヷヸヹヺ"

There may be easier ways to do this, but here goes. The Unicode package contains everything you need.

First we can get a list of unicode scripts and the block ranges:

library(Unicode)  

uranges <- u_scripts()

Check what we've got:

head(uranges, 3)

$Adlam
[1] U+1E900..U+1E943 U+1E944..U+1E94A U+1E94B          U+1E950..U+1E959 U+1E95E..U+1E95F

$Ahom
 [1] U+11700..U+1171A U+1171D..U+1171F U+11720..U+11721 U+11722..U+11725 U+11726          U+11727..U+1172B U+11730..U+11739 U+1173A..U+1173B U+1173C..U+1173E U+1173F         
[11] U+11740..U+11746

$Anatolian_Hieroglyphs
[1] U+14400..U+14646

Next we can convert the ranges into their sequences.

expand_uranges <- lapply(uranges, as.u_char_seq)

To get a single vector of all characters we can unlist it. This won't be easy to work with so really it would be better to keep them as a list:

all_unicode_chars <- unlist(expand_uranges)

# The Wikipedia page linked states there are 144,697 characters 
length(all_unicode_chars)
[1] 144762

So seems to be all of them and the page needs updating. They are stored as integers so to print them (assuming the glyph is supported) we can do, for example, printing Japanese katakana:

intToUtf8(expand_uranges$Katakana[[1]])

[1] "ァアィイゥウェエォオカガキギクグケゲコゴサザシジスズセゼソゾタダチヂッツヅテデトドナニヌネノハバパヒビピフブプヘベペホボポマミムメモャヤュユョヨラリルレロヮワヰヱヲンヴヵヶヷヸヹヺ"
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文