有什么好的方法可以“编码”吗?二进制数据作为看似合理的虚构单词然后又返回?
给你一个非常简单但不好的例子。数据被分成 4 位。这 16 个可能的数字对应于前 16 个辅音。您添加一个随机元音以使其可以发音。所以“08F734F7”可以变成“ba lo ta ku fo go ta ka”。你可以加入一些音节,加上标点符号和大写字母,就可以变成“Balo ta kufogo,Taka?”这看起来像是一种看似合理的语言。
澄清一下,我并不是想保护二进制数据。
我想在压缩和加密我的 (UTF-8) 纯文本日记后使用它。生成的二进制数据应该看起来相当随机。我需要将这些数据转换为一种看起来合理的语言,并能够将其恢复回来。我要把“语言”打印在纸上并制作一本定制书。
所以我正在寻找的是将随机数据转换为可读的可信单词的最佳方法。我所说的“好”是指最大的位与字母的比率(同时使其看起来像真正的语言)。在我的示例中,每个字母恰好是 2 位。或者 4 个字母代表一个字节。
To give you a very simple and bad example. The data is split in 4 bits. The 16 possible numbers correspond to the first 16 consonants. You add a random vowel to make it pronounceable. So "08F734F7" can become "ba lo ta ku fo go ta ka". You can join some syllables and add punctuation and capitalization and it can become "Balo ta kufogo, Taka?" which looks like a plausible language.
Just to make it clear, I'm not trying to protect the binary data.
I want to use this after I compress and encrypt my (UTF-8) plain text diary. The resulting binary data should look pretty random. I need to convert this data into a plausible looking language and be able to revert it back. I'm going to print the "language" on paper and make a custom book.
So what I'm looking for is the best method of converting random data to readable plausible words. By good I mean the biggest bits to letters ratio (while making it look like a real language). In my example it's exactly 2 bits per letter. Or 4 letters for a byte.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
有趣的问题!
到目前为止,我的最佳解决方案将 12 位编码为 2 到 4 个字符,每个字母编码 3 到 6 位。 (星期五不适合对单词长度的不均匀分布进行必要的数学计算,所以我还没有计算出每个字母的平均位数)。
这个想法是使用以一两个辅音开头、以一两个元音结尾的“音素”。辅音有21个,我觉得每个辅音后面都可以加h、l、r、w、y,看起来还是很合理的。所以你的音素以 126 个辅音部分之一开始 - b, bh, bl, br, bw, by, c, ch, cl, cr, ..., z, zh, zl, zr, zw, zy (诚然,认为像 yy 和 zl 看起来有点奇怪,但它毕竟是一门外语:) )
126 非常接近 128,我们可以为最后两个值添加 t' 和 b'(例如) - 给我们一个字典128 个值,存储 7 位。您甚至可以添加将 yy 替换为 d',将 zl 替换为 p' 或其他内容。
类似地,元音部分可以是单个元音或一对元音。我已经消除了 aa、ii 和 uu,因为它们对我来说看起来太奇怪了(个人偏好),即使它们确实出现在一些真实的单词中(谁决定“连续体”应该这样拼写!)。这样就给出了 27 个可能的元音部分:a、e、i、o、u、ae、ai、ao、...、ue、ui、uo。
27 接近 32,因此使用重音元音(é、â 等)输入 5 个值。这给了我们 5 位,并且还有一些稀疏重音的额外好处。
这就是 2、3 或 4 个字母中的 12 位。
为了更有趣,如果下一位是 1,则 90% 的时间插入空格(随机),或者在另外 10% 的情况下插入标点符号 - 但如果下一位是 0,则不插入任何内容 - 只是开始下一个音素。标点符号后的第一个字母大写。
这应该会给你类似的信息:
Bwaijou t'ei plo ku bhaproti! Llanoi proimlaroo jaévli。
也许有人可以更进一步。
FASCINATING question!
My best solution so far encodes 12 bits in 2 to 4 characters, giving 3 to 6 bits per letter. (Friday is not a good day to do the necessary maths on the uneven distribution of word lengths, so I haven't worked out the average bits per letter).
The idea is to work with "phonemes" that start with one or two consonants, and end with one or two vowels. There are 21 consonants and I feel that each one could be followed by an h, l, r, w or y, and still look reasonable. So your phoneme starts with one of 126 consonant parts - b, bh, bl, br, bw, by, c, ch, cl, cr, ..., z, zh, zl, zr, zw, zy (admittedly, thinks like yy and zl look a bit weird, but it is a foreign language after all :) )
126 is so tantalisingly close to 128 that we could add t' and b' (for example) for the last two values - giving us a dictionary of 128 values, to store 7 bits. You could even add replace yy with d' and zl with p' or whatever.
Similarly, the vowel portion could be a single vowel or a pair of vowels. I have eliminated aa, ii and uu because they look too weird to me (personal preference) even though they do occur in some real words (who decided "continuum" should be spelt that way anyway!). So that gives 27 possible vowel parts: a, e, i, o, u, ae, ai, ao, ..., ue, ui, uo.
27 is close to 32, so throw in 5 values using an accented vowels (é, â, etc). That gives us 5 bits with the added benefit of some sparse accenting.
So that's 12 bits in 2, 3 or 4 letters.
For more fun, if the next bit is a 1, insert a space 90% of the time (at random), or a punctuation mark the other 10% - but if the next bit is a 0, don't insert anything - just start the next phoneme. Capitalise the first letter after punctuation.
That should give you something like:
Bwaijou t'ei plo ku bhaproti! Llanoi proimlaroo jaévli.
Maybe someone can take it further.
摘要
该解决方案将允许您使用多种语言中的任何一种,包括具有可定制效率的可发音的废话。您甚至可以创建一些看起来语法正确但毫无意义的英语或法语(或者更糟糕的是,像醉酒的多语言者一样在两者之间转换的东西)。基本思想是使用数据从上下文无关语法中不断选择路径,直到用完数据。
详细信息
将一个字符串添加到输入的末尾,该字符串不会出现在其中的任何位置(例如,“这是我输入的末尾,非常感谢”不太可能出现在加密文本字符串中。 )您可以在没有字符串的情况下执行此操作,但这样会更容易。
首先将您的输入视为一个非常长的整数编码的低位。显然你的机器将无法处理这么大的整数,每次你有一个零高字节时,只需从文件中去掉下一个字节的值并将它们相乘即可。
将你的语言创建为 上下文无关语法。为了避免忘记编码是什么,您可以将其打印在书的末尾。避免歧义。如果你的语法不明确,你将无法解码。这并不难,本质上不要在两个地方使用相同的终端,确保两个终端的串联不能产生另一个终端,并确保读取输出时您可以知道将格式化符号放在哪里。
现在,要获取一个整数并将其转换为语言,请使用以下伪代码,该代码使用 n 来选择要采用的产生式。
要解码,您可以使用标准解析器生成器,例如 GNU bison 这也可以帮助您避免创建语法歧义。
对输入运行解析器。现在,n从0开始。通过引用解析器生成的语法树,您可以获取每次的产生数。然后将 n 乘以产品数量,并添加产品数量以在特定输入后得到 n。当您填充机器字的低字节时,将其转移到解码文件中。当你读到你独特的短语时,停止解码。
示例 1
通过一个或三个示例会更清楚。
我的示例简单语言如下(非终结符大写)。请注意,由于与树的深度相比,终端的尺寸较大,因此效率不是很高,但我认为拥有更多终端或使终端更短可以为您提供所需的任何效率(最多可达每个字符浪费的位数)每个字符使用 n 位)。
您可以像使用动词和名词的扩展一样轻松地使用音节来做到这一点。您还可以包含名词短语和动词短语,以在您的语言中包含形容词等。您可能还需要将段落和章节符号分解为具有格式的适当子单元。树的某一层的替代选择的数量决定了每个符号编码的平均位数。 __capital 是一个格式化符号的示例,它在输出时将下一个单词大写。
因此,假设我的输入变成数字 77。然后我将其编码如下:
S 代表两件事。 77 % 2 = 1。余数 77 / 2 = 38。
现在我们的数字是 38,我们正在扩展 __capital,名词,T 动词,名词,Punct
第一个单词是 __capital,它是一个终止符号。输出 __capital(设置打印例程以将下一个单词大写)。
现在扩展名词。名词有 6 个选项。 38 % 6 = 2。 38 / 6 = 6. 我们选择spot
现在扩展spot,T-动词,名词,标点符号。现货是终端。输出点。处于大写模式的打印机将“Spot”写入输出文件。
现在扩展 T-Verb。我们的数字是 6。T 动词有 4 个选项。 6 % 4 = 2。6 / 4 = 1。所以我们选择“增长”。在下一步中,我们将增长输出到我们的文件,因为它是终端。
现在扩展名词,Punct。名词有 6 个选项。我们的数字是 1. 1 % 6 = 1 1/6 = 0。因此我们选择“sally”,这是下一步的输出。
最后我们扩展了 Punct,它有 3 个选项。我们的数字是 0(并且将永远保持这种状态 - 这就是为什么您将文本结尾字符串附加到输入的末尾,否则您的解码将以无限的零字符串结束。)我们选择“.”,这是输出。
现在要扩展的当前字符串为空,因此我们将其设置回根“S”。但由于 n 为 0,算法终止。
于是77就成了“现货长莎莉”。
示例 2
如果我将终端替换为:
77 节产生“Jo papa ja”。在此编码下(并且实际上仅由“Jo”和爸爸有 2 个音节的事实进行编码。在任何书本长度的文件中,额外的部分都将是非常小的一部分。)
示例 3
您的示例“08F734F7”将是“1000111101110011010011110111” ”,二进制反转为“1110111100101100111011110001”,十进制为250793713。
如果我通过更紧凑的语法来运行它,我会得到:
这会产生:
“Ja pysy vy?Vo pa ja。”来自 08F734F7
(请注意,我的打印例程删除了标点符号之前的空格)
Summary
This solution will let you use any of a large number of languages including pronounceable nonsense with a customizable efficiency. You can even create something that looks like grammatically correct but meaningless English or French (or worse, something that shifts between the two like a drunken polyglot). The basic idea is to use your data to continually select paths from a context free grammar until you run out of data.
Details
Add a string to the end of your input that doesn't occur anywhere inside of it ("This is the end of my input, thank you very much" would be very unlikely to occur in a string of encrypted text, for example.) You can do this without the string but it makes it easier.
Treat your input as one very long integer encoded low-bit first. Obviously your machine won't be able to process such a big integer, every time you have a zero high byte, just strip off the next byte worth of values from your file and multiply them in.
Create your language as a context free grammar. To avoid forgetting what the encoding is, you can print it at the end of your book. Avoid ambiguity. If your grammar is ambiguous, you won't be able to decode. This is not hard, essentially don't use the same terminal in two places, ensure that the concatenation of two terminals cannot make another terminal, and ensure that reading the output you can tell where you put the formatting symbols.
Now, to take an integer and turn it into language, use the following pseudo-code that uses n to choose which production to take.
To decode you use a standard parser generator like GNU bison which should also help you avoid creating an ambiguous grammar.
Run the parser on the input. Now, start n at 0. You can get the production number at each time by referencing the syntax tree generated by the parser. Then multiply n by the number of productions and add the production number to get n after that particular input. As you fill up the lower byte of your machine word, shift it off into your decoded file. When you read your unique phrase, stop decoding.
Example 1
This will be clearer with an example or three.
My example simple language is as follows (non-terminals are capitalized). Note because of the large size of the terminals compared with their depth of the tree, it is not very efficient but I think that having more terminals or making them shorter can give you any efficiency you want (up to the number of bits wasted per character by using n bits per character).
You could just as easily do this with syllables as an expansion of verbs and nouns. You could also include noun-phrases and verb phrases to have adjectives etc. in your language. You will probably also want paragraph and chapter symbols that break down into appropriate subunits with formatting. The number of alternate choices at a certain level of the tree determines the average number of bits encoded by each symbol. __capital is an example of a formatting symbol that, on output, capitalizes the next word.
So, imagine that my input becomes the number 77. Then I would encode it as follows:
S goes to two things. 77 % 2 = 1. Remainder 77 / 2 = 38.
Now our number is 38 and we are expanding __capital, Noun, T-Verb, Noun, Punct
First word is __capital which is a terminal symbol. Output __capital (setting the print routine to capitalize the next word).
Now expanding Noun. Noun has 6 options. 38 % 6 = 2. 38 / 6 = 6. We choose spot
Now expanding spot,T-verb,Noun,Punct. Spot is terminal. Output spot. The printer being in capital mode writes "Spot" to the output file.
Now expanding T-Verb. Our number is 6. T-verb has 4 options. 6 % 4 = 2. 6 / 4 = 1. So we choose "grows". In the next step we output grows to our file since it is a terminal.
Now expanding Noun, Punct. Noun has 6 options. Our number is 1. 1 % 6 = 1 1/6 = 0. So we choose "sally", which is output on the next step.
Finally we are expanding Punct which has 3 options. Our number is 0 (and will stay that way forever - this is why you append an end-of-text string to the end of your input, otherwise your decoding would end with an infinite string of zeroes.) We choose ".", which is output.
Now the current string to expand is empty so we set it back to the root "S". But since n is 0, the algorithm terminates.
Thus 77 has become "Spot grows sally."
Example 2
Things get more efficient if I replace my terminals with:
77 yields "Jo papa ja." under this encoding (and is really encoded by just the "Jo " and the fact that papa has 2 syllables. The extra would be a very small fraction in any book-length file.)
Example 3
Your example "08F734F7" would be "1000111101110011010011110111" in binary, which is "1110111100101100111011110001" when reversed and that is, 250793713 in decimal.
If I run that through the more compact grammar, I get:
This yields:
"Ja pysy vy? Vo pa ja." from 08F734F7
(note that my print routine removes spaces before punctuation)
这是一个老问题,但很有趣。
有一次我想做类似的转换,但有其他目标。 Guid(uuid)通常对眼睛不友好,所以我必须将其转换为合理的单词。最终的系统是基于前两个字母之后出现的英文字母。该表是使用英语句子语料库制作的,并且排除了很少使用的句子。
因此,最终表包含的行看起来
大约包含 200-300 行,其中“next”是可以出现在“key”字母之后的所有可能的字母(_ 是单词的开头或结尾,具体取决于它是在 key 中还是在下一个)。
转换过程将当前值除以长度模(下一个),并将余数作为对应的字母作为下一个“合理”符号,商成为新的当前值。为了避免长单词,有一个技巧可以通过编码和解码来显式结束对称使用的单词。例如,该系统可以生成这样的序列(每个序列的输入是 128 位 guid/uuid),
或者如果我们采用一些广泛使用的 guid,例如 MS IWebBrowser2 {D30C1661-CDAF-11D0-8A3E-00C04FC9E26E}
(“Lakar Rupplex”是对于浏览器来说,这是一个很好的人名,不是吗?)
至于密度,该系统给出了每个字母大约 3 位的密度。
This is an old question, but very interesting.
Once I wanted to do a similar conversion, but having other goals. Guid (uuids) are usually not eye-friendly so I had to convert it to a plausible words. The final system was based on occurrence of English letter after two preceding ones. This table was made using a corpus of English sentences and the ones that was used too rarely was excluded.
So the final table contained lines looking like
containing about 200-300 lines, where 'next' is all possible letters that can appear after 'key' letters (_ is the begin or end of the word depending on whether it's in the key or in next).
The process of conversion took the current value, divide it modulo length(next) and take remainder as the corresponding letter as the next 'plausible' symbol, quotient becomes new current value. To avoid long words there was a trick to explicitly end words used symmetrically by encoding and decoding. This system could produce for example such sequences (input for each is 128-bit guid/uuid)
or if we take some widely used guids, for example MS IWebBrowser2 {D30C1661-CDAF-11D0-8A3E-00C04FC9E26E}
("Lakar Rupplex" is a good human name for a browser, isn't it?)
As for the density, this system gave about 3 bits per letter density.
我个人会使用c++。对于执行您所描述的操作的程序,我会制作如下内容:
这应该将 src 数据分解为 4 位部分,将其添加到“a”并将其放入目标中。然后,您可以遍历并在之间添加额外的字母,但前提是您有一种反转该过程的字符串方式。
为了让它不那么明显,我一次会使用超过 4 位,但也不会使用偶数 8 位。下面是一个使用 6 位块的示例:
这将使每个 6 位块由 3 个字母的单词组成的混乱。
I personally would use c++. For a program that would do what you describe I would make something like this:
This should break up the src data into 4 bit sections, add that to 'a' and put it in the destination. You can then go through and add extra letters between, but only as long as you have a string way of reversing the process.
To make it a little less obvious I would use more than 4 bits at a time, but not an even 8 either. Here is an example with using chunks of 6 bits:
This would make a jumble of 3 letter words for each chunk of 6 bits.
您可以使用一组转换表来执行简单的替换算法,该转换表根据原始数字中数字的幂而变化。对转换表中的值进行加权,以便元音和某些辅音更常见。选择一些足够大的基地,以便在各个地方都有多样性。例如(基于十六进制的数据):(
这也可以通过为每列精心选择的简单公式来完成...)
因此,
将其扩展到足够的列以很好地控制所有字符的分布。如果您的源数据没有完全随机的分布,您可能还需要混合列与列之间的顺序。请注意有些字符如何存在于每一列中,而有些字符只存在一次。此外,可以通过更改每列中的平均比率来调整元音与辅音的频率。
获取大的固定大小的数据块并通过转换器运行它们,然后应用间距/标点符号/大写算法。
(不能保证您最终不会得到全辅音或极低元音计数的单词,但您可以使用大写算法使其全部大写,看起来像首字母缩略词/首字母缩写词)
You could do a simple substitution algorithm with a set conversion table that varies based on the power of the digit in the original number. Weight the values in the conversion tables so that vowels and certain consonants are more common. Pick some base large enough to have variety across the places. E.g. (hex based data):
(This could also be done with simple well chosen formulas for each column...)
So,
Extend this out to enough columns to control the distribution of all the characters well. If your source data does not have a cleanly random distribution you may want to mix up the order from column to column as well. Notice how some characters exist in every column, and some only exist once. Also the vowel to consonant frequency can be tweaked by changing the average ratio in every column.
Take large fixed size chunks of the data and run them through the converter, then apply a spacing/punctuation/capitalization algorithm.
(There is no guarantee that you won't end up with an all consonant or extremely low vowel count word, but you can have the capitalization algorithm make it all caps to look like an acronym/initialism)
阅读此处 http://email.about.com/cs/standards/a/base64_encoding .htm
Read here http://email.about.com/cs/standards/a/base64_encoding.htm