当前位置：文江博客话题详情

单向哈希函数如何工作？（已编辑）

发布于 2024-08-19 10:31:19 字数 105 浏览 6 评论 0原文

我阅读了有关 md5 哈希的维基百科文章，但我仍然无法理解哈希如何无法“重构”回原始文本。

有人可以向对密码学知之甚少的人解释一下它是如何工作的吗？该函数的哪一部分使其成为单向的？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

枕梦 2024-08-26 10:31:19

由于到目前为止每个人都简单地定义了哈希函数是什么，所以我会咬牙切齿。

单向函数不仅仅是一个哈希函数（一种丢失信息的函数），而且是一个函数 f，给定图像 y（“SE”或294（现有答案中），很难找到满足 f(x)=y 的原像 x。

这就是为什么它们被称为单向：您可以计算图像，但无法找到给定图像的原像。

到目前为止，现有答案中提出的普通哈希函数都不具有此属性。它们都不是单向加密哈希函数。例如，给定“SE”，您可以轻松获取输入“SXXXE”，该输入具有 X-encode(“SXXXE”)=SE 的属性。

不存在“简单”的单向函数。他们必须很好地混合输入，以便您不仅在输出中根本无法识别输入，而且您也无法识别其他输入。

SHA-1 和 MD5 曾经是流行的单向函数，但它们都几乎被破坏了（专家知道如何为给定图像创建原像，或者几乎能够这样做）。正在进行一场选择新标准的竞赛，该标准将被命名为 SHA -3。

反转单向函数的一种明显方法是计算许多图像并将它们保存在一个表中，将每个图像与产生它的原像相关联。为了在实践中实现这一点，所有单向函数都具有较大的输出，至少 64 位，但可能更大（例如，高达 512 位）。

编辑：大多数加密哈希函数如何工作？

通常它们的核心有一个函数，可以对比特块进行复杂的转换（块密码）。该函数应该是双射的。并且这个函数被迭代固定次数，足以使输入（或任何可能的输入）无法识别。

以 Skein 为例，它是 SHA-3 环境的有力候选者之一。其核心函数迭代了72次。该函数的创建者有时知道如何将输出与某些输入关联起来的唯一迭代次数是 25 次。他们说它的“安全系数”为 2.9。

Since everyone until now has simply defined what a hash function was, I will bite.

A one-way function is not just a hash function -- a function that loses information -- but a function f for which, given an image y ("SE" or 294 in existing answers), it is difficult to find a pre-image x such that f(x)=y.

This is why they are called one-way: you can compute an image but you can't find a pre-image for a given image.

None of the ordinary hash function proposed until now in existing answers have this property. None of them are one-way cryptographic hash functions. For instance, given "SE", you can easily pick up the input "SXXXE", an input with the property that X-encode("SXXXE")=SE.

There are no "simple" one-way functions. They have to mix their inputs so well that not only you don't recognize the input at all in the output, but you don't recognize another input either.

SHA-1 and MD5 used to be popular one-way functions but they are both nearly broken (specialist know how to create pre-images for given images, or are nearly able to do so). There is a contest underway to choose a new standard one, which will be named SHA-3.

An obvious approach to invert a one-way function would be to compute many images and keep them in a table associating to each image the pre-image that produced it. To make this impossible in practice, all one-way function have a large output, at least 64 bits but possibly much larger (up to, say, 512 bits).

EDIT: How do most cryptographic hash functions work?

Usually they have at their core a single function that does complicated transformations on a block of bits (a block cipher). The function should be bijective. And this function is iterated a fixed number of times, enough to make the input (or any possible input) impossible to recognize.

Take the example of Skein, one of the strong candidates for the SHA-3 context. Its core function is iterated 72 times. The only number of iterations for which the creators of the function know how to sometimes relate the outputs to some inputs is 25. They say it has a "safety factor" of 2.9.

回复收藏 0 原文

温暖的光 2024-08-26 10:31:19

考虑一个非常基本的哈希 - 对于输入字符串，返回每个字符的 ASCII 值的总和。

hash( 'abc' ) = ascii('a')+ascii('b')+ascii('c')
              = 97 + 98 + 99
              = 294

现在，给定哈希值 294，你能说出原始字符串是什么吗？显然不是，因为“abc”和“cba”（以及无数其他）给出相同的哈希值。

加密哈希函数的工作方式相同，只是算法显然要复杂得多。总是会发生冲突，但如果您知道字符串 s 哈希值 h，那么构建应该非常困难（“计算上不可行”）< /em> 另一个也哈希为 h 的字符串。

Think of a really basic hash - for the input string, return the sum of the ASCII values of each character.

hash( 'abc' ) = ascii('a')+ascii('b')+ascii('c')
              = 97 + 98 + 99
              = 294

Now, given the hash value of 294, can you tell what the original string was? Obviously not, because 'abc' and 'cba' (and countless others) give the same hash value.

Cryptographic hash functions work the same way, except that obviously the algorithm is much more complex. There are always going to be collisions, but if you know string s hashes to h, then it should be very difficult ("computationally infeasible") to construct another string that also hashes to h.

回复收藏 0 原文

凌乱心跳 2024-08-26 10:31:19

这里的目的是做一个简单的类比，而不是复杂的解释。

首先，让我们将主题分为两部分：单向操作和哈希。什么是单向操作？为什么需要单向操作？

之所以这样称呼操作，是因为它们是不可逆的。大多数典型的运算（例如加法和乘法）可以反转，而模除法则不能反转。为什么这很重要？因为您想要提供一个输出值，该输出值 1) 在没有原始输入的情况下很难复制，2) 无法从输出中找出输入。

可逆

加法：

4 + 3 = 7

可以通过求和并减去一个加数之一来反转

7 - 3 = 4

乘法：

4 * 5 = 20

可以通过求积并除以其中一个因子来反转

20 / 4 = 5

不可逆

模除法：

22 % 7 = 1

这不能逆转，因为您无法对商和被除数进行任何操作来重新构成除数（反之亦然）。

你能找到一个操作来填补“？”的位置吗？是？

1  ?  7 = 22  
1  ?  22 = 7

话虽如此，单向哈希函数与模除法具有相同的数学质量。

为什么这很重要？

假设我给了你一把巴士总站储物柜的钥匙，该车站有一千个储物柜，并要求你将其交给我的银行职员。作为一个聪明人，更不用说多疑了，你会立即查看钥匙，看看钥匙上写着什么储物柜号码。知道这一点后，我做了一些不正当的事情；首先，我发现两个数字，当使用模除法除法时，会得到一个范围在 1 到 1000 之间的数字，第二，我删除了原始数字，并在上面写下了这对数字的除数，第二，我选择了一个巴士总站，它有一个警卫通过只让人们每天用钥匙尝试一个储物柜来保护储物柜免受不法之徒的侵害，第三，银行家已经知道红利，因此当他拿到钥匙时，他可以进行数学计算并算出剩余部分并知道要打开哪个储物柜。

如果我明智地选择操作数，我可以接近商和被除数之间的一对一关系，这迫使您尝试每个储物柜，因为答案将可能输入的结果分布在所需数字的范围内，储物柜在终端中可用。基本上，这意味着即使您知道其中一个操作数，您也无法获得有关余数的任何知识。

所以，现在我可以“信任”您将钥匙交给其合法所有者，而不必担心您可以轻松猜出它属于哪个储物柜。当然，你可以暴力搜查所有的储物柜，但这需要近三年的时间，我的银行工作人员有足够的时间使用钥匙并清空储物柜。

有关不同哈希函数的更多细节，请参阅其他答案。

Shooting for a simple analogy here instead of a complex explanation.

To start with, let's break the subject down into two parts, one-way operations and hashing. What is a one-way operation and why would you want one?

One way operations are called that because they are not reversible. Most typical operations like addition and multiplication can be reversed while modulo division can not be reversed. Why is that important? Because you want to provide a output value which 1) is difficult to duplicate without the original inputs and 2) provides no way to figure out the inputs from the output.

Reversible

Addition:

4 + 3 = 7

This can be reversed by taking the sum and subtracting one of the addends

7 - 3 = 4

Multiplication:

4 * 5 = 20

This can be reversed by taking the product and dividing by one of the factors

20 / 4 = 5

Not Reversible

Modulo division:

22 % 7 = 1

This can not be reversed because there is no operation that you can do to the quotient and the dividend to reconstitute the divisor (or vice versa).

Can you find an operation to fill in where the '?' is?

1  ?  7 = 22  
1  ?  22 = 7

With that being said, one-way hash functions have the same mathematical quality as modulo division.

Why is this important?

Lets say I gave you a key to a locker in a bus terminal that has one thousand lockers and asked you to deliver it to my banker. Being the smart guy you are, not to mention suspicious, you would immediately look on the key to see what locker number is written on the key. Knowing this, I've done a few devious things; first I found two numbers that when divided using modulo division gives me a number in the range between 1 and 1000, second I erased the original number and written on it the divisor from the pair of numbers, second I chose a bus terminal that has a guard protecting the lockers from miscreants by only letting people try one locker a day with their key, third the banker already knows the dividend so when he gets the key he can do the math and figure out the remainder and know which locker to open.

If I choose the operands wisely I can get near to a one-to-one relationship between the quotient and the dividend which forces you to try each locker because the answer spreads the results of the possible inputs over the range of desired numbers, the lockers available in the terminal. Basically, it means you can't acquire any knowledge about the remainder even if you know one of the operands.

So, now I can 'trust' you to deliver the key to its rightful owner without worrying that you can easily guess to which locker it belongs. Sure, you could brute force search all the lockers but that would take almost 3 years, plenty of time for my banker to use the key and empty the locker.

See the other answers for more specifics on the different hash functions.

回复收藏 0 原文

后来的我们 2024-08-26 10:31:19

这是一个非常简单的例子。假设我是一名初级密码学家，我创建了一个执行以下操作的哈希函数：

int SimpleHash(file) {
    return 0 if file.length is even;
    return 1 if file.length is odd;
}

现在这是测试。 SimpleHash(specialFile) 是 0。什么是我的原始文件？

显然，没有办法知道（尽管您可能很容易发现我的哈希值是基于文件长度的）。无法根据哈希“重建”我的文件，因为哈希不包含我的文件所做的所有内容。

Here's a very simple example. Assume that I'm a beginning cryptographer and I create a hash function that does the following:

int SimpleHash(file) {
    return 0 if file.length is even;
    return 1 if file.length is odd;
}

Now here's the test. SimpleHash(specialFile) is 0. What was my original file?

Obviously, there's no way to know (although you could likely discover pretty easily that my hash is based on file length). There is no way to "reconstitute" my file based on the hash because the hash doesn't contain everything that my file did.

回复收藏 0 原文

亂 2024-08-26 10:31:19

简单来说，哈希函数的工作原理是将输入数据打乱。

例如，请参阅 MD5 。它按 512 位块处理输入数据。每个块被分成 16 个 32 位字。有 64 个步骤，每个步骤使用 16 个输入单词之一。因此，每个单词在算法过程中都会使用四次。这就是单向性的来源：任何输入位都在多个位置输入，并且在两个这样的输入之间，该函数将所有当前数据混合在一起，以便每个输入位影响大部分 128 位运行状态。这可以防止您通过仅查看部分数据来反转函数或计算碰撞。您必须查看整个 128 位，而 128 位块的空间太宽，无法有效地遍历。

现在 MD5 在这方面做得并不好，因为可以找到该函数的冲突。从密码学家的角度来看，MD5是一种旋转加密函数。一个消息块 M（512 位）的处理使用输入状态 V（128 位值）并计算新状态 V'，如下所示：V' = V + E(M, V)，其中“+”是一个字-明智的加法，“E”恰好是一个对称加密函数（又名“分组密码”），它使用 M 作为密钥，使用 V 作为要加密的消息。仔细看，E可以是一种“扩展的Feistel网络”，类似于DES分组密码，有四个四分之一而不是两半。细节在这里并不重要；我的观点是，在使用该结构（称为“Merkle-Damgård”）的哈希函数中，什么使“良好”的哈希函数与使分组密码“安全”的要素类似。对 MD5 的成功碰撞攻击使用了差分密码分析，这是一种最初设计用于攻击分组密码的工具。

从好的分组密码到好的哈希函数，有一个不可忽视的步骤。对于 Merkle-Damgård 结构，如果底层分组密码能够抵抗“相关密钥攻击”，那么哈希函数就是安全的，“相关密钥攻击”是一个相当模糊的属性，针对该属性，分组密码很少得到加强，因为对于对称加密，相关密钥攻击几乎没有任何实际意义。影响。例如，事实证明，AES 加密对于相关密钥攻击的抵抗力并不如人们所希望的那样，这并没有引发普遍的恐慌。这种抵抗力并不是设计 AES 时所寻求的属性的一部分。它只是阻止将 AES 转换为哈希函数。有一个名为 Whirlpool 的哈希函数，它建立在 Rijndael 的衍生版本之上，“Rijndael”是 AES 的最初名称；但 Whirlpool 小心翼翼地修改了 Rijndael 中对相关密钥攻击较弱的部分。

此外，还有其他结构可用于构建哈希函数。当前的标准函数（MD5、SHA-1 和“SHA-2”系列，又名 SHA-224、SHA-256、SHA-384 和 SHA-512）是 Merkle-Damgård 函数，但许多可能的函数继任者则不然。 NIST（处理此类事务的美国联邦组织）组织了一场正在进行的竞赛，以选择一种新的标准哈希函数，称为“SHA-3”。有关详细信息，请参阅此页面。现在，他们从最初的 51 名候选人减少到了 14 名（不包括另外十几个未能通过发送完整的提交以及正确编译和运行的代码的管理测试的人）。

现在让我们更概念性地了解一下。安全哈希函数应该看起来像一个随机预言机：预言机是一个黑匣子，当给定消息M作为输入时，输出答案h(M )，在输出空间中随机、均匀地选择（即，如果散列函数输出长度为n，则所有n位字符串）。如果再次给出相同的消息M作为输入，则预言机输出与之前相同的值。除了该限制之外，预言机对先前未使用的输入M的输出是不可预测的。人们可以把预言机想象成一个扔骰子的侏儒的容器，并在一本大书里仔细记录输入消息和相应的输出，以便他履行他的预言机合同。由于侏儒本人并不知道，因此无法预测下一个输出是什么。

如果存在随机预言，则反转哈希函数的成本为 2^n：为了获得给定的输出，没有比使用不同的输入消息更好的策略，直到产生预期值。由于统一随机选择，每次尝试的成功概率为 1/(2^n)，向掷骰子侏儒的平均请求数将为 2^n< /em>.对于冲突（找到两个产生相同哈希值的不同输入），成本约为 1.42^(n/2)* （粗略地说，1.42^ (n/2)* 个输出，我们可以组装大约 2^n 对输出，每个输出的匹配概率为 1/(2^n)，即两个不同的输入具有相同的输出）。这些是使用随机预言机可以完成的最好的事情。

因此，我们寻找与随机预言机一样好的哈希函数：它们必须以这样一种方式混合输入数据，即我们无法比简单地调用函数2^( n/2) 次。哈希函数的祸根是数学结构，即允许攻击者将哈希函数内部状态（很大，至少 n 位）视为数学对象的变体的快捷方式，该数学对象存在于空间更短。 30 年的对称加密系统公共研究已经产生了一整套可以应用的概念和工具（扩散、雪崩、微分、线性......）。然而，底线是我们没有证据证明随机预言可能确实存在。我们想要一个不能被攻击的哈希函数。我们拥有的是候选哈希函数，目前尚无已知对其进行攻击，而且，更好的是，我们拥有一些某些类型的函数的攻击可以被证明是行不通的。

仍有一些研究要做。

In simple terms, a hash function works by making a big tangled mess of the input data.

See MD5 for instance. It processes input data by 512-bit blocks. Each block is split into 16 32-bit words. There are 64 steps, each step using one of the 16 input words. So each word is used four times within the course of the algorithm. This is where one-wayness comes from: any input bit is input at several places, and between two such inputs the function mixes all the current data together so that each input bit impacts most of the 128-bit running state. This prevents you from inverting the function, or computing a collision, by looking at only a part of the data. You have to look at the whole 128 bits, and the space of 128-bit blocks is too wide to be efficiently walked through.

Now MD5 does not do a good job at it, since collisions for that function can be found. From a cryptographer point of view, MD5 is a rotated encryption function. The processing of one message block M (512 bits) uses an input state V (a 128-bit value) and computes the new state V' as V' = V + E(M, V) where '+' is a word-wise addition, and 'E' happens to be a symmetric encryption function (aka a 'block cipher') which uses M as key and V as the message to be encrypted. From a closer look, E can is a kind of "extended Feistel network", similar to the DES block cipher, with four quarters instead of two halves. Details are not important here; my point is that what makes a "good" hash function, among hash functions which use that structure (called "Merkle-Damgård"), is similar to what makes a block cipher "secure". The successful collision attacks on MD5 use differential cryptanalysis, a tool which was designed to attack block ciphers in the first place.

From a good block cipher to a good hash function, there is a step which is not to be dismissed. With the Merkle-Damgård structure, the hash function is secure if the underlying block cipher is resistant to "related key attacks", a rather obscure property against which block ciphers are rarely strengthened because, for symmetric encryption, related key attacks barely have any practical impact. For instance, the AES encryption turned out not to be as resistant to related key attacks as could be wished for, and this did not trigger general panic. That resistance was not part of the properties which were sought for when AES was designed. It just prevents turning the AES into a hash function. There is a hash function called Whirlpool, which builds on a derivate of Rijndael, "Rijndael" being the initial name of what became the AES; but Whirlpool takes care to modify the parts of Rijndael which are weak to related key attacks.

Also, there are other structures which can be used for building a hash function. The current standard functions (MD5, SHA-1, and the "SHA-2" family, aka SHA-224, SHA-256, SHA-384 and SHA-512) are Merkle-Damgård functions, but many of the would-be successors are not. There is an ongoing competition, organized by the NIST (the US federal organization which deals with that kind of things), to select a new standard hash function, dubbed "SHA-3". See this page for details. Right now, they are down to 14 candidates from an initial 51 (not counting a dozen extra which failed the administrative test of sending a complete submission with code which compiles and runs properly).

Let's now have a more conceptual look. A secure hash function should look like a random oracle: an oracle is a black box which, when given a message M as input, outputs an answer h(M) which is chosen at random, uniformly, in the output space (i.e. all n-bit strings if the hash function output length is n). If given the same message M again as input, the oracle outputs the same value than previously. Apart from that restriction, the output of the oracle on a non previously used input M is unpredictable. One can imagine the oracle as a container for a gnome who throws dice, and carefully records the input messages and corresponding outputs in a big book, so that he will honor his oracle contract. There is no way to predict what the next output will be since the gnome himself does not know that.

If a random oracle exists, then inverting the hash function has cost 2^n: in order to have a given output, there is no better strategy than using distinct input messages until one yields the expected value. Due to the uniform random selection, probability of success is 1/(2^n) at each try, and the average number of requests to the dice-throwing gnome will be 2^n. For collisions (finding two distinct inputs which yields the same hash value), the cost is about 1.42^(n/2)* (roughly speaking, with 1.42^(n/2)* outputs, we can assemble about 2^n pairs of output, each having a probability of 1/(2^n) of matching, i.e. having two distinct inputs which have the same output). These are the best that can be done with a random oracle.

Therefore, we look for hash functions which are as good as a random oracle: they must mix the input data in such a way that we cannot find a collision more efficiently than what it would cost to simply invoke the function 2^(n/2) times. The bane of hash function is mathematical structure, i.e. shortcuts which allow the attacker to view the hash function internal state (which is big, at least n bits) as a variation on a mathematical object which lives in a much shorter space. 30 years of public research on symmetric encryption systems have produced a whole paraphernalia of notions and tools (diffusion, avalanche, differentials, linearity...) which can be applied. Bottom-line, however, is that we have no proof that a random oracle may actually exist. We want a hash function which cannot be attacked. What we have are hash function candidates, for which no attack is currently known, and, somewhat better, we have some functions for which some kinds of attack can be proven not to work.

There is still some research to be done.

回复收藏 0 原文

苦笑流年记忆 2024-08-26 10:31:19

哈希是一种（非常）有损的编码。

举一个更简单的例子，想象一个 5 字母单词的虚构 2 字母编码，称为 X 编码。 X 编码的算法很简单：取单词的第一个和最后一个字母。

因此，

X-encode( SAUCE ) = SE
X-encode( BLOCK ) = BK

显然，您无法从其编码 SE 中重建 SAUCE（假设我们可能的输入范围都是 5 个字母的单词）。这个词也可以是“空间”。

顺便说一句，SAUCE 和 SPACE 都产生 SE 作为编码的事实称为“冲突”，您可以看到 X 编码不会产生很好的哈希。 :)

A hash is a (very) lossy encoding.

To give you a simpler example, imagine a fictitious 2-letter encoding of a 5-letter word called the X-encoding. The algorithm for the X-encoding is simple: take the first and last letters of the word.

So,

X-encode( SAUCE ) = SE
X-encode( BLOCK ) = BK

Clearly, you cannot reconstruct SAUCE from its encoding SE (assuming our range of possible inputs is all 5-letter words). The word could just as easily be SPACE.

As an aside, the fact that SAUCE and SPACE both produce SE as an encoding is called a collision, and you can see that the X-ecoding wouldn't make a very good hash. :)

回复收藏 0 原文

晨曦慕雪 2024-08-26 10:31:19

数组
仔细一看，关联数组看起来非常像散列。主要区别是哈希名称上缺少 % 符号，并且一次只能分配给它们一个键。因此，人们会说 $foo{'key'} = 1;，但只能说 @keys = keys(foo);。熟悉的函数（如each、keys和values）的工作方式与现在一样（Perl 2中添加了delete）。

Perl 3 具有三种完整的数据类型：它在哈希名称上有 % 符号，允许一次分配整个哈希，并添加了 dbmopen（现已弃用，取而代之的是 tie）。 Perl 4 使用逗号分隔的哈希键来模拟多维数组（现在可以通过数组引用更好地处理）。

Perl 5 迈出了巨大的一步，将关联数组称为散列。（据我所知，它是第一种这样引用数据结构的语言，而不是“哈希表”或类似的东西。）有点讽刺的是，它还将相关代码从 hash.c 移至 hv.c。

命名法
如前所述，字典是由唯一键索引的无序值集合。它们有时被称为关联数组或映射。它们可以通过多种方式实现，其中之一是使用称为哈希表的数据结构（这就是 Perl 所说的哈希）。

Perl 对术语“散列”的使用是一些潜在混淆的根源，因为散列函数的输出有时也称为散列（尤其是在加密上下文中），并且因为散列表在其他地方通常不称为散列。

为了安全起见，将数据结构称为哈希表，并且仅在明显的、特定于 Perl 的上下文中使用术语“哈希”。