Perl:如何编码和解码仅大写字母的字符
我现在处理这个问题已经有一段时间了。我有来自拉丁字母的字符,并希望它们仅以大写字母字符串进行编码。有没有任何模块可以做到这一点?或者我可以修改为仅使用 uc 字母字符的任何 BaseX 编码?
我目前已经使用正则表达式替换实现了它的一部分,但它只覆盖了几个字符,而且肯定效率不高:)
无论如何,如果没有办法通过模块或函数来处理它, 有什么方法可以通过正则表达式有效地做到这一点吗?
我想到了 tr/[\+,\-,...]/[PLUS,MINUS,...]/cds;
但似乎 tr 只替换字符而不是字符按字符序列:(
有什么想法吗?
achim
im dealing with that quite some time now. I have characters from lets say the Latin alphabet and want them to be encoded in uppercase alpha strings only. Is there any module that could do this? Or any BaseX encoding that i can modify to just use uc alpha characters?
i currently have implemented parts of it using regex substitutions, but it only covers a few characters and is definetly not efficient :)
anyway if there is no way to deal with that via a module or function,
is there any way to do this efficient via a regex?
i thought about a tr/[\+,\-,...]/[PLUS,MINUS,...]/cds;
but it seems like tr only substitutes char by char and not char by sequence of chars :(
any ideas?
achim
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
回答
tr
问题:基数 26 是可以实现的,但实现起来有点困难且效率低下,因为 26 不是 2 的幂。但这绝对是您想要的。我会看看如何编码。
同时,这是一个基数 16 的解决方案:
让我们看看基数 26 与基数 16 相比的效率如何:
高效的实现会产生效率稍低的输出。
请注意,有效的实现对 7 字节长的输入使用额外的数字。
那么使用 Base26 比使用 Base16 值得吗?可能不会,除非每个字节都非常珍贵。
最后,这是一个基于 26 的实现。
To answer the
tr
question:Base 26 is possible to do, but it's a bit hard and inefficient to implement since 26 is not a power of 2. But it's definitely what you want. I'll see about coding it up.
In the meantime, here's a base 16 solution:
Let's see how efficient base 26 is compared to base 16:
An efficient implementation would produce slightly less efficient output.
Note that the efficient implementation uses an extra digits for inputs that are 7 bytes long.
So is it worth the effort of using base26 over base16? Probably not, unless each byte is really precious.
And finally, here's a base 26 implementation.
最简单的方法是使用 base16 编码(正如其他人建议的那样),并将数字重新映射为字母 - 但这样您只使用 26 个字符中的 16 个,这是浪费的。
最有效的编码可能是 base26,但这会非常困难——实际上,您会将整个输入视为一个大的二进制数,并将其从基数 2 转换为基数 26。log2
(26) 刚刚超过 4.7 ,因此最多(在没有压缩的情况下)每个字母可以编码 4.7 位。一种不太浪费的编码可能会用 7 个字母编码 4 个字节(32 位)。 7 个字母可为您提供大约 32.9 位信息,因此您不会丢失太多信息。而且这一切都可以用 32 位算术来完成。然后您必须决定如果输入不是 4 字节的倍数该怎么办。
(实际的实施留作练习——至少目前是这样。)
The simplest approach is to use base16 encoding, as others have suggested, and remap the digits to letters -- but then you're only using 16 out of 26 characters, which is wasteful.
The most efficient possible encoding would be base26, but that would be very difficult -- in effect you'd be treating the entire input as a large binary number and converting it from base 2 to base 26.
log2(26) is just over 4.7, so at best (in the absence of compression) you can encode 4.7 bits per letter. A less wasteful encoding might encode 4 bytes (32 bits) in 7 letters. 7 letters gives you about 32.9 bits of information, so you're not losing as much information. And it can all be done in 32-bit arithmetic. Then you'll have to decide what to do if the input isn't a multiple of 4 bytes.
(The actual implementation is left as an exercise -- at least for now.)
您可以使用Base32编码,26个大写字母和6个数字:
http://pastebin.com/YPvfrpHW
只需将
$code
数组更改为您想要使用的任何字符集。编辑:哎呀,刚刚注意到你是 Perl 而不是 PHP,抱歉。您应该能够在 CPAN 上找到具有相同功能的 Base32 模块。
编辑2:FWIW,我在CPAN上看到Convert::Base32、Encode::Base32和MIME::Base32。
You can use Base32 encoding, with 26 uppercase letters and 6 digits:
http://pastebin.com/YPvfrpHW
Just change the
$code
array to whatever charset you want to use.Edit: Whoops, just noticed you're Perl and not PHP, sorry. You should be able to find a Base32 module on CPAN that does the same thing.
Edit 2: FWIW, I see Convert::Base32, Encode::Base32, and MIME::Base32 on CPAN.
为了好玩,这是我的 Enigma 模拟器。没有一种简单的方法可以实现您想要做的事情,因为轮子没有任何转义字符,并且您发明的任何表示转义序列的序列都会显着降低密码的强度。
然而,8 位拉丁输入可以使用 65+($Char&15).65+($Char>>4) 从 0-255 映射到 [AP][AP],并在输出时反转,但 RZ 将是浪费了,并且输入中会有很多漏洞,尽管这可以首先通过 gzip 来解决。
德国人通常用 X 来代表空格,如果确实需要的话,会拼写出标点符号,尽量避免将相同的东西拼写两次。
我知道这很烦人,但事实就是这样。如果我们增加轮子上的字母数量,那么它就不再是恩尼格玛机了!
For a bit of fun, here's my Enigma simulator. There isn't an easy way to achieve what you want to do as the wheels don't have any escape chars, and any sequences you invent to represent an escape sequence will significantly reduce the strength of the cipher.
However, 8 bit latin input could be mapped from 0-255 to [A-P][A-P] using 65+($Char&15).65+($Char>>4), and reversed on output, but R-Z would be wasted and there would be a lot of holes in the input, though this could be solved running through gzip first.
The Germans usually used X to represent spaces, and spelled out punctuation if really necessary, trying to avoid spelling the same thing twice the same.
I know it is annoying, but that is the way it is. If we increase the number of letters on the wheels, then it ceases to be an Enigma machine!
Keith Thompson 和 jrockway。
我们在这里研究并实现它。
如果您知道的话,问题就非常简单:
将文件的字节视为以 28 = 256 为基数的数字的数字。
通常我们使用
0
,1
,2
, ... 但也可以使用A
,B< /code>、
C
、…甚至是This solution was already briefly mentioned by Keith Thompson and jrockway.
Here we look into it and implement it.
The problem is very simple if you know that:
Think of the file's bytes as the digits of a number in a base 28 = 256.
Usually we use
0
,1
,2
, … but its possible to useA
,B
,C
, … or even????
,????
,????
, … instead.Therefore, an approach to encode your (text) file using only
A
-Z
is:A
-Z
.Here's an implementation:
This prints the encoding
ESQEKWWQLSBQHVKBCAQYKLXMVQRUFOOMPJGFTADLYTDQLFGTRTLWJBYTJICKUOFUVPHSHZHCRZKFMVSHRHCACZFUWTXVXUDRVKMIAIKK
which is then decoded correctly again.Benefits:
Drawbacks:
However, in practice this probably is acceptable as I suspect you only deal with rather short strings. On my system (i5-4570, 3.2 GHz), the very un-optimized implementation from above instantly encoded and decoded 1 kB. Without GMP 10 kB took 11 seconds. With GMP 100 kB took 10 seconds.
Implementation Notes:
utf8::encode
/decode
on the input andour $...Digits
.from_base
andto_base
could probably be implemented more efficiently because the default implementation might not know that the digits$...Digits
are continuous.Base64 编码生成十六进制输出,即 16 个可能的字符。因为字母表有 26 个,所以您可以将数字与数字交换。然后,您将仅使用 16 个字母表字符,但您将得到一个仅由字母组成的字符串,在其中很容易进行编码解码并取回原始字符串。这是一个奇怪的问题(看起来像家庭作业),但它会解决问题。
Base64 encoding generates hexadecimal output, meaning 16 possible characters. Because the alphabet has 26, you could possibly swap digits with numbers. You'll then use just 16 characters of the alphabet, but you'd have a string consisting of just alphabetic letters where it's really easy to encode-decode and get the original string back. It's a strange question (and it looks like homework assignment) but it will do the trick.
您已经指定了一个非常有损的翻译...这可能并不令人满意。
但是:
请注意,如果消息指定加油位置为“73N 39W”,潜艇艇长将得到无用的方向...
You've specced a very lossy translation... which may not be satisfactory.
But:
Note that the submarine captain is going to get useless direction if the message specifies the refueling location as "73N 39W"...