ASCII 编码 UTF-8 的有效方法

发布于 2024-08-27 03:34:53 字数 337 浏览 9 评论 0原文

我正在寻找一种简单有效的方法来以 ASCII-7 存储 UTF-8 字符串。高效的意思是:

  • 输入中的所有 ASCII 字母数字字符应与输出中的 ASCII 字母数字字符保持相同
  • 生成的字符串应尽可能短
  • 操作需要可逆,而不会丢失任何数据
  • 生成的 ASCII 字符串应为大小写不敏感
  • 对输入长度不应该有限制
  • 应该允许整个 UTF-8 范围

我的第一个想法是使用 Punycode (IDNA),因为它符合前四个要求,但在后两个要求上失败了。

谁能推荐一种替代编码方案?如果有一些代码可供查看就更好了。

I'm looking for a simple and efficient way to store UTF-8 strings in ASCII-7. With efficient I mean the following:

  • all ASCII alphanumeric chars in the input should stay the same ASCII alphanumeric chars in the output
  • the resulting string should be as short as possible
  • the operation needs to be reversable without any data loss
  • the resulting ASCII string should be case insensitive
  • there should be no restriction on the input length
  • the whole UTF-8 range should be allowed

My first idea was to use Punycode (IDNA) as it fits the first four requirements, but it fails at the last two.

Can anyone recommend an alternative encoding scheme? Even better if there's some code available to look at.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

甜宝宝 2024-09-03 03:34:54

由于 ASCII 涵盖了 7 位值的整个范围,因此保留所有 ASCII 字符、7 位长并对整个 Unicode 范围进行编码的编码方案是不可能的。

编辑添加:

我想我现在明白你的要求了。您正在寻找一种以七位代码对 UTF-8 字符串进行编码的方法,其中,如果该编码字符串被解释为 ASCII 文本,则可以任意修改字母字符的大小写,但解码后的字符串将与原始字节逐字节相同。

如果是这种情况,那么最好的选择可能就是将原始数据的二进制表示形式编码为十六进制数字字符串。我知道您正在寻找更紧凑的表示形式,但考虑到系统的其他限制,这是一个相当高的要求,除非设计了一些自定义编码。

由于十六进制表示可以对任意二进制值进行编码,因此可以在获取十六进制值之前通过压缩字符串来缩小字符串。

Since ASCII covers the full range of 7-bit values, an encoding scheme that preserves all ASCII characters, is 7-bits long, and encodes the full Unicode range is not possible.

Edited to add:

I think I understand your requirements now. You are looking for a way to encode UTF-8 strings in a seven-bit code, in which, if that encoded string were interpreted as ASCII text, then the case of the alphabetic characters may be arbitrarily modified, and yet the decoded string will be byte-for-byte identical to the original.

If that's the case, then your best bet would probably be just to encode the binary representation of the original as a string of hexadecimal digits. I know you are looking for a more compact representation, but that's a pretty tall order given the other constraints of the system, unless some custom encoding is devised.

Since the hexadecimal representation can encode any arbitrary binary values, it might be possible to shrink the string by compressing them before taking the hex values.

找回味觉 2024-09-03 03:34:54

如果您谈论的是非标准方案 - MECE

If you're talking about non-standard schemes - MECE

你的背包 2024-09-03 03:34:54

URL 编码或数字字符引用是两个可能的选项。

URL encoding or numeric character references are two possible options.

风柔一江水 2024-09-03 03:34:54

这取决于字符串中字符的分布。

Quoted-printable 对于大多数 ASCII 字符串很有用,因为除了“=”和控制字符之外没有任何开销。但是,每个非 ASCII 字符占用 6-12 个字节,效率很低,因此如果您有很多非 ASCII 字符,则需要考虑使用 UTF-7 或 Base64。

It depends on the distribution of characters in your strings.

Quoted-printable is good for mostly-ASCII strings because there's no overhead except with '=' and control characters. However, non-ASCII characters take an inefficient 6-12 bytes each, so if you have a lot of those, you'll want to consider UTF-7 or Base64 instead.

妥活 2024-09-03 03:34:54

Punycode 用于 IDNA,但您可以在其施加的限制之外使用它

本身,Punycode 不会满足您的最后 2 个要求:(

>>> import sys
>>> _ = ("\U0010FFFF"*10000).encode("punycode")
>>> all(chr(c).encode("punycode") for c in range(sys.maxunicode))
True

对于 idna,python 提供另一种同名编码)

显然,如果您不 nameprep 输入,编码的字符串不再严格区分大小写...但是如果您只提供小写字母(或者如果您不关心解码后的大小写),那么您应该可以开始

Punycode is used for IDNA, but you can use it outside the restrictions imposed by it

Per se, Punycode doesn't fail your last 2 requirements:

>>> import sys
>>> _ = ("\U0010FFFF"*10000).encode("punycode")
>>> all(chr(c).encode("punycode") for c in range(sys.maxunicode))
True

(for idna, python supplies another homonymous encoding)

obviously, if you don't nameprep the input, the encoded string isn't strictly case-insensitive anymore... but if you supply only lowercase (or if you don't care about the decoded case) you should be good to go

苏辞 2024-09-03 03:34:53

UTF-7,或者,稍微不那么透明但更广泛,引用打印

输入中的所有 ASCII 字符应在输出中保留 ASCII 字符

(显然这不完全可能,因为您至少需要一个字符来充当转义符。)

UTF-7, or, slightly less transparent but more widespread, quoted-printable.

all ASCII chars in the input should stay ASCII chars in the output

(Obviously not fully possible as you need at least one character to act as an escape.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文