以可逆的方式从大整数生成伪自然短语
我有一个大且“唯一”的整数(实际上是 SHA1 哈希值)。
注意:虽然我在这里谈论 SHA1 哈希值,但这不是密码学/安全问题!我没有试图破坏 SHA1。想象一下用一个随机 160 位整数代替 SHA1 如果有帮助的话。
我想(没有其他原因,只是为了玩得开心)找到一种算法,将 SHA1 哈希值映射到计算机生成的(伪)英语短语。映射应该是双向的(即,知道算法,必须能够从该短语计算原始 SHA1 哈希值。)
该短语不需要有意义。我什至愿意接受一整段废话。 (尽管段落的质量——英语性——可能比单纯的短语更好。)
更好的算法会产生更短、更自然、更独特的短语。
一种变体:如果我只能使用哈希的一部分,那就可以了。比如说,前六位十六进制数字就可以了。
生成的短语的可能用法:Git 提交 ID 的人类可读版本,用作给定程序版本的座右铭,该版本是根据该提交构建的。 (正如我所说,这是“为了好玩”。我并不认为这非常实用,或者比 SHA1 本身更具可读性。)
可能的方法:过去我尝试构建一个概率表(单词),并根据我从 SHA 读取的位生成马尔可夫链形式的短语,为生成器播种(从概率树中选取分支)。这不是很成功,产生的短语太长而且难看。我不确定这是一个错误,还是算法中的一般缺陷,因为我不得不尽早放弃它。
现在我正在考虑再次尝试解决该问题。关于如何处理这个问题有什么建议吗? 您认为马尔可夫链方法在这里可行吗?还有别的吗?
I have a large and "unique" integer (actually a SHA1 hash).
Note: While I'm talking here about SHA1 hashes, this is not a cryptography / security question! I'm not trying to break SHA1. Imagine a random 160-bit integer instead of SHA1 if that will help.
I want (for no other reason than to have fun) to find an algorithm to map that SHA1 hash to a computer-generated (pseudo-)English phrase. The mapping should be bidirectional (i.e., knowing the algorithm, one must be able to calculate the original SHA1 hash from that phrase.)
The phrase need not make sense. I would even settle for a whole paragraph of nonsense. (Though quality — englishness — of a paragraph should probably be better than for a mere phrase.)
A better algorithm would produce shorter, more natural-looking, more unique phrases.
A variation: it is OK if I will be able to work only with a part of hash. Say, first six hex digits is fine.
The possible usage of the generated phrase: the human readable version of Git commit ID, to use as a motto for a given program version, which is built from that commit. (As I said, this is "for fun". I don't claim that this is very practical — or be much more readable than the SHA1 itself.)
Possible approach: In the past I've attempted to build a probability table (of words), and generate phrases as Markov chains, seeding the generator (picking branches from probability tree), according to the bits I read from the SHA. This was not very successful, the resulting phrases were too long and ugly. I'm not sure if this was a bug, or the general flaw in the algorithm, since I had to abandon it early enough.
Now I'm thinking about attempting to solve the problem once again. Any advice on how to approach this? Do you think Markov chain approach can work here? Something else?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
一个非常简单的方法是:
列出 1024 个名词、1024 个动词和 1024 个形容词的列表。然后,您的短语可以是以下形式的句子:
通过更多的语言学思考,您可能可以构建稍微复杂一点的广告,因此看起来不那么重复的句子(例如,单数/复数的一点,不同时态的两个的一点,... )。较长的单词列表会占用更多的位,但我的猜测是您会很快到达相当奇特的单词。
A very simple approach would be:
Take list of say 1024 nouns, 1024 verbs and 1024 adjectives each. Your phrase could then be sentence of the form
With a bit more linguistic thought you can probably construct slightly more complicated ad thus not so repetitive looking sentences (say, a bit for singular / plural, a bit of two for different tenses,...). Longer word lists use up a few more bits but my guess is that you reach rather exotic words quite fast.
我们,让我们看看... 英语 大约有 1,000,000 个单词。每个字大约有 20 位。 SHA1 是 160 位,因此您需要 8 个字。理论上,您所要做的就是取出牛津英语词典的第 n 个单词,其中 n 是一次 20 位的一组。
现在,为了使其更自然,您可以尝试根据单词的类型(名词、动词...)使用一些简单的算法在单词之间添加“in/at/on/and/the...”。 (当然,您应该从基本词典中删除所有这些单词)。
该算法是可逆的:只需删除已添加的所有单词,并将每个单词转换为其 20 位索引。
另外,尝试谷歌“侮辱生成器”。其中一些生成器非常好。不过,我不确定组合的数量。
您可以购买《牛津英语词典》光盘版,单词数超过 500,000 (19 位)。然而,我不确定提取单词及其类型是否容易。我不确定这是否合法,但我认为你不能对字典条目申请专利......
We'll, lets see... The english language has about 1,000,000 words. That's about 20 bits per word. SHA1 is 160 bits, so you'll need 8 words. Theoretically, All you'll have to do is to take the n'th word of the oxford english dictionary, where n is a group of 20 bits at a time.
Now, to make it more natural, you can try to add "in/at/on/and/the..." between words, according to their type (nouns,verbs...) using some simple algorithm. (You should remove all these words from your base dictionary, of course).
The algorithm is reversible: Just remove all the words you've added, and convert each word to it's 20-bit index.
Also, try google "insult generator". Some of those generators are pretty nice. I'm not sure about the number of combinations, though.
You can buy the Oxford English Dictionary on CD-ROM with more than 500,000 words (19-bit). I'm not sure if it would be easy to extract the words and their types, however. I'm not sure if it is legal, but I think you can't claim a patent on dictionary entries...
这是一个老问题,但 entropoetry 是一个 JavaScript(Node/frontend)库,也解决了这个问题。它结合了马尔可夫诗歌和霍夫曼编码,因此给定相同的字典(即相同版本的库),转换诗歌↔︎数字将是双向的。
例如,从 Node 命令行:
随着技术不断进步,看起来就像 2011 年“仅供娱乐”的想法在 2017 年有一些实际用途:记忆加密货币私钥(脑钱包)、Dat/IPFS 链接等。
This is an old question but entropoetry is a JavaScript (Node/frontend) library that also solves this problem. It combines Markov poetry with Huffman coding, so given the same dictionary (i.e., the same version of the library), converting poetry↔︎numbers will be bidirectional.
Example, from the Node command line:
And as technology marches on, what seemed like a “fun only” idea in 2011 has some real uses in 2017: memorizing cryptocurrency private keys (brain wallet), Dat/IPFS links, etc.
哈希函数意味着不可能(在合理的范围内)从哈希中获取数据,除非它被破坏(不安全)。
问题应该是关于破坏 SHA-1 哈希算法 - 看看 Google,它没那么破坏。所以不,你不能从 SHA-1 哈希码创建英语短语,如果可以的话,请就此写一篇大论文,其中很多都是无用的,这将是突破:-)
编辑:如果只有哈希的一部分就足够了,我建议只是暴力破解(+哈希<->短语的简单映射,可能在文件或数据库中),打破哈希算法是非常“强汤”(困难的问题)。
编辑2:大家在提出问题时要更具体,这不是我的错...我不会删除这个,以免吓跑周围的任何其他加密货币人员:-)
Hash function means it is not possible (within reasonable limits) to get a data from hash, unless it is broken (insecure).
Question should be about breaking SHA-1 hash algorithm - look at Google, it's not that broken. So no, you cannot create English phrase from SHA-1 hash code, if you can, please make a huge paper about that, lot of them are useless, this would be breakthrough :-)
Edit: if only part of hash is enough, I suggest just brute force (+ simple map of hash<->phrase, possibly in a file or db), breaking hash algorithm is very "strong soup" (difficult problem).
Edit2: guys be more specific when asking question, not my fault... I will not delete this so that it scares off any other crypto guys around :-)