在Python中代表具有一个字符的多个值
我有2个在0-31范围内的值。我希望能够以1个字符表示这两个值(例如,在基本64中解释了我用1个字符的含义),但仍然能够知道这两个值是什么,哪个值是第一个。
I have 2 values that are in the range 0-31. I want to be able to represent both of these values in 1 character (for example in base 64 to explain what I mean by 1 character) but still be able to know what both of the values are and which came first.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
找到一个具有1024个连续编码点的不错的Unicode块,例如
正如您提到的Base64 ...这是不可能的。基本64中的每个字符仅允许6位数据,您需要10个来表示您的两个数字。
还要注意,尽管这只是一个字符,但它占用了两个或三个字节,具体取决于您使用的编码。正如其他人指出的那样,无法将10位数据填充到8位字节中。
说明:
A * 32 + B
只需将两个数字映射到[0,32)中的两个数字中[0,1024)中的一个数字。例如,0 * 32 + 0 = 0
;31 * 32 + 31 = 1023
。
找到了与此的unicode字符CodePoint,但是具有低编码点(例如chr
chr0
)的字符是不可打印的,并且将是一个糟糕的选择,因此结果已转移到一个不错的大型Unicode块的开头:0x4e00
是19968
的十六进制表示,并且是CJK统一意识形态块中第一个字符的编码点。使用示例值,17 * 32 + 3 = 547
和19968 + 547 = 20515
,或0x5023
在十六进制中,这是这是这是字符仿
。因此,chr(20515)=“仿”
。char_decode
函数仅对所有这些操作进行反面:如果a * p + b = x
,然后a,b = divmod(x,p)< /code>(请参阅
) 。如果divmod
divmodc = chr(x)
,则x = ord(c) functions.html#ord“ rel =“ nofollow noreferrer”>
ord
)。而且我敢肯定,您知道,如果W + r = y
,则r = y -w
。因此,在示例中,ord(“仿”)= 20515
;20515-0x4e00 = 547
;Divmod(547,32)
是(17,3)
。Find a nice Unicode block that has 1024 contiguous codepoints, for example CJK Unified Ideographs, and map your 32*32 values onto them. In Python 3:
As you mention Base64... this is impossible. Each character in a Base64 encoding only allows for 6 bits of data, and you need 10 to represent your two numbers.
And also note that while this is only one character, it takes up two or three bytes, depending on the encoding you use. As noted by others, there is no way to stuff 10 bits of data into an 8-bit byte.
Explanation:
a * 32 + b
simply maps two numbers in range [0, 32) into a single number in range [0, 1024). For example,0 * 32 + 0 = 0
;31 * 32 + 31 = 1023
.chr
finds the Unicode character with that codepoint, but characters with low codepoints like0
are not printable, and would be a poor choice, so the result is shifted to the beginning of a nice big Unicode block:0x4E00
is a hexadecimal representation of19968
, and is the codepoint of the first character in the CJK Unified Ideographs block. Using the example values,17 * 32 + 3 = 547
and19968 + 547 = 20515
, or0x5023
in hexadecimal, which is the codepoint of the character倣
. Thus,chr(20515) = "倣"
.The
char_decode
function just does all of these operations in reverse: ifa * p + b = x
, thena, b = divmod(x, p)
(seedivmod
). Ifc = chr(x)
, thenx = ord(c)
(seeord
). And I am sure you know that ifw + r = y
, thenr = y - w
. So in the example,ord("倣") = 20515
;20515 - 0x4E00 = 547
; anddivmod(547, 32)
is(17, 3)
.值[0,31]可以以5位存储,因为
2 ** 5 == 32
。因此,您可以明确地将两个这样的值存储在10位。相反,除非其他条件成立,否则您将无法明确地从少于10位中检索两个5位值。如果您使用的是允许1024个或更多不同字符的编码,则可以将您的对映射到字符。否则,您根本不能。因此,ASCII不会在这里工作,Latin1也不是。但是,几乎所有“正常”的Unicode编码都很好。
请记住,对于UTF-8之类的东西,实际角色将占用10位以上。如果这是一个问题,请考虑使用UTF-16左右。
Values [0, 31] can be stored in 5 bits, since
2**5 == 32
. You can therefore unambiguously store two such values in 10 bits. Conversely, you will not be able to unambiguously retrieve two 5-bit values from fewer than 10 bits unless some other conditions hold true.If you are using an encoding that allows 1024 or more distinct characters, you can map your pairs to characters. Otherwise you simply can't. So ASCII is not going to work here, and neither is Latin1. But pretty much any of the "normal" Unicode encodings are fine.
Keep in mind that for something like UTF-8, the actual character will take up more than 10 bits. If that's a concern, consider using UTF-16 or so.