如何输入非 BMP unicode（超过 4 个字符的十六进制）作为 Mathematica 的输入

发布于 2024-12-14 12:35:00 字数 766 浏览 3 评论 0原文

问题描述： 数学使用 “\:nnnn” 作为 unicode 输入的语法。例如，如果我们输入 “\:6c34”，我们得到“水”（中文“水”）。但是如果有人想输入“\:1f618”（脸上抛吻）怎么办？当我尝试这个时，我得到了“ὡ8”，而不是“一张脸扔了一个吻”。因此，在我输入 "8" 之前，Mathematica 会计算 "\:1f61"。

问题： 我们怎样才能延迟这个评估或者我们怎样才能输入任何unicode输入（对于超过4个字符的十六进制）？

软硬件平台： 我在 Intel Mac 上运行 Mathematica 8。我尝试了 Mathematica 的命令行版本和 Mathematica Notebook，他们的行为是一样的。

谢谢。

反思： Unicode 是一个可扩展标准并且它可以增长（而且它确实在增长:)）。实现该标准的软件系统可能仅实现该标准的一个子集，以便有效且有用（8 位、16 位或 32 位编码）。第一，作为某个软件包的用户，不应该假设一旦该软件说它支持unicode，它就支持unicode的通用集。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

冬天旳寂寞 2024-12-21 12:35:00

简短回答：您不能执行此操作，因为 Mathematica 无法正确支持这些字符。请参阅帖子末尾的一些解决方法。

澄清一些事情：

不需要32 位编码来处理超过约 65000 个 Unicode 字符。最常用的 Unicode、UTF-8 和 UTF-16 编码是 < a href="http://en.wikipedia.org/wiki/Variable-width_encoding" rel="nofollow noreferrer">多字节编码，意味着使用可变数量的字节来表示字符。 UTF-16 可以使用 2 或 4 个字节来表示一个字符。 Mathematica 内核会将每个 2 字节序列解释为字符串中的单个字符，有时会导致一些无效字符（当遇到 4 字节序列时）。这可能被认为是一个错误。前端对于如何处理 4 字节序列相当喜怒无常，这绝对是一个错误。

有限的解决方法

当严格在内核中工作时（例如从文件中读取 Unicode 数据），我有时会使用此函数作为解决方法来获取 2 单元（4 字节）的实际 Unicode 代码点UTF-16 序列：

toCodePoint[{a_, b_}] /; 16^^d800 <= a <= 16^^dbff && 16^^dc00 <= b <= 16^^dfff := (a - 16^^d800)*2^10 + (b - 16^^dc00) + 16^4

您可以将

Split[ToCharacterCode[str], If[16^^d800 <= # <= 16^^dbff, True] &]

UTF-16 字符串正确拆分为 Unicode 字符（长度一或长度二，具体取决于字符）。

这是一个丑陋且不方便的解决方法，并且它将不允许您在前端显示这些字符中的任何内容，除非您也为此想出一些技巧，例如从 unicode.org 导入字形参考图像（位于至少对于 CJK 他们有）。

另请参阅

请参阅我之前关于同一主题的问题：
在 Mathematica 中读取 UTF-8 编码的文本文件

如果您要使用中文工作，您可能还会遇到其他问题：
让 Mathematica 前端遵守 FontFamily 选项

Short answer: You can't do this because Mathematica doesn't support these characters properly. See at the end of the post for some workarounds.

Just to clear up some things:

There's no need for a 32-bit encoding to handle more than ~65000 Unicode characters. The most common encodings used for Unicode, UTF-8 and UTF-16, are multibyte encodings, meaning that a variable number of bytes are used to represent characters. UTF-16 can use either 2 or 4 bytes to represent a character. The Mathematica kernel will interpret every 2-byte sequence as a single character in a string, resulting in some invalid characters on occasion (when encountering a 4-byte sequence). This may be considered a bug. The front end is quite moody about how it handles 4-byte sequences, which is definitely a bug.

Limited workaround

When working strictly in the kernel (e.g. reading the Unicode data from a file), I sometimes use this function as a workaround to get the actual Unicode code point of 2-unit (4-byte) UTF-16 sequences:

toCodePoint[{a_, b_}] /; 16^^d800 <= a <= 16^^dbff && 16^^dc00 <= b <= 16^^dfff := (a - 16^^d800)*2^10 + (b - 16^^dc00) + 16^4

You can use

Split[ToCharacterCode[str], If[16^^d800 <= # <= 16^^dbff, True] &]

to split a UTF-16 string into Unicode characters correctly (either length-one or length-two, depending on the character).

This is an ugly and inconvenient workaround, and it will won't allow you to display anything of these characters in the front end unless you come up with some hack for that as well, e.g. importing the glyph reference images from unicode.org (at least for CJK they have them).

See also

See my earlier question on the same topic:
Reading an UTF-8 encoded text file in Mathematica

If you are going to work with Chinese, you may come across this other problem too:
Getting the Mathematica front end to obey the FontFamily option

回复收藏 0 原文

星星的軌跡 2024-12-21 12:35:00

根据 Mathematica 8 帮助中的此页面：

Mathematica supports both 8- and 16-bit raw character encodings.

大概他们是说它们不支持支持您所需字符所需的 32 位编码。

作为进一步的证据（在文档中没有明确声明的情况下），同一页面上支持的编码列表没有 32 位编码。 32 位编码显然仅在 MathLink 中受支持。我认为用户需求还不够。

According to this page in the Mathematica 8 help:

Mathematica supports both 8- and 16-bit raw character encodings.

Presumably they are saying that they don't support 32-bit encodings as would be needed to support your desired character.

As further evidence (in the absence of a clear statement in the documentation), the list of supported encodings on the same page has no 32-bit encodings. 32-bit encodings are apparently only supported in MathLink. I suppose there hasn't been enough user demand.

回复收藏 0 原文

~没有更多了~