没有桶的完美哈希可能吗?
我被要求寻找一个完美的哈希/单向函数,以便能够对 10^11 个数字进行哈希处理。 然而,由于我们将使用嵌入式设备,它不会有内存来存储相关的存储桶,所以我想知道是否有可能在没有它们的情况下获得一个像样的(最小的)完美哈希?
计划是使用设备对数字进行散列,我们使用彩虹表或使用散列作为偏移量的文件。
干杯
编辑:
我会尝试提供更多信息:)
1) 10^11 实际上现在是 10^10,这样就更容易了。这个数字是可能的组合。所以我们可以获得 0000000001 到 10000000000 (10^10) 之间的数字。
2)计划将其作为单向功能的一部分,以确保号码安全,以便我们可以通过不安全的方式发送它。 然后我们将使用彩虹表查找另一端的原始号码。 问题是源设备一般有512k-4Meg的内存可供使用。
3)它必须是完美的——我们100%不能发生碰撞。
编辑2:
4)我们不能使用加密,因为我们被告知这在设备上实际上是不可能的,如果我们可以的话,密钥管理将是一场噩梦。
Edit3:
因为这是不明智的,现在纯粹是学术问题(我保证)
I've been asked to look for a perfect hash/one way function to be able to hash 10^11 numbers.
However as we'll be using a embedded device it wont have the memory to store the relevant buckets so I was wondering if it's possible to have a decent (minimal) perfect hash without them?
The plan is to use the device to hash the number(s) and we use a rainbow table or a file using the hash as the offset.
Cheers
Edit:
I'll try to provide some more info :)
1) 10^11 is actually now 10^10 so that makes it easer.This number is the possible combinations. So we could get a number between 0000000001 and 10000000000 (10^10).
2) The plan is to us it as part of a one way function to make the number secure so we can send it by insecure means.
We will then look up the original number at the other end using a rainbow table.
The problem is that the source the devices generally have 512k-4Meg of memory to use.
3) it must be perfect - we 100% cannot have a collision .
Edit2:
4) We cant use encryption as we've been told it's not really possable on the devices and keymanigment would be a nightmare if we could.
Edit3:
As this is not sensible, Its purely academic question now (I promise)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
好吧,既然你已经澄清了你想要做什么,我重写了我的答案。
总结一下:使用真正的加密算法。
首先,让我回顾一下为什么你的哈希系统是一个坏主意。
无论如何,你的哈希系统是什么?
据我了解,您提出的系统是这样的:
您的嵌入式系统(我将其称为 C)正在发送某种值空间为 10^11 的数据。这些数据在传输到某种服务器(我将其称为 S)的过程中需要保密。
您的建议是将值
hash(salt + data)
发送到S。然后S将使用彩虹表来反转该哈希并恢复数据。salt
是 C 和 S 都知道的共享值。这是一种加密算法
加密算法,归根结底就是任何为您提供机密性的算法。由于您的目标是保密,因此任何满足您目标的算法都是加密算法,包括这个算法。
这是一种非常糟糕的加密算法。
首先,存在不可避免的冲突机会。此外,冲突值的集合每天都不同。
其次,即使对于合法服务器 S,解密也非常消耗 CPU 和内存。更改盐的成本甚至更高。
第三,虽然您既定的目标是避免密钥管理,但您的盐是关键!你根本没有解决密钥管理问题;任何有盐的人都能够像你一样破解该消息。
第四,它只能从C到S使用。您的嵌入式系统C将没有足够的计算资源来反转哈希,并且只能发送数据。
这并不比嵌入式设备上的真实加密算法快
。大多数安全散列算法的计算成本与合理的分组密码一样昂贵,甚至更差。例如,SHA-1 要求对每个 512 位块执行以下操作:
每 512 位消息有一个块,并且末尾可能有一个额外块。每个块有 1136 个二进制操作(不包括内存操作),或者每个字节大约有 16 个操作。
为了进行比较,RC4 加密算法每个字节需要四次操作(三次加法,加上消息上的异或),再加上两次数组读取和两次数组写入。它还只需要 258 字节的工作内存,而 SHA-1 的峰值需要 368 字节。
密钥管理是基础
对于任何保密系统,都必须拥有某种秘密。如果你没有秘密,那么其他任何人都可以实现相同的解码算法,你的数据就会暴露给全世界。
因此,对于将秘密放在哪里,您有两种选择。一种选择是使加密/解密算法保密。然而,如果算法的代码(或二进制文件)被泄露,你就输了——替换这样的算法是相当困难的。
因此,秘密通常很容易被替换——这就是我们所说的密钥。
您建议使用的哈希算法将需要一个盐 - 这是系统中唯一的秘密,因此是一个密钥。不管你喜欢与否,你都必须小心地管理这个密钥。如果泄露,替换它比其他密钥要困难得多 - 每次更改时,您都必须花费许多 CPU 小时来生成新的彩虹表!
你应该做什么?
使用真正的加密算法,并花一些时间实际考虑密钥管理。这些问题之前已经得到解决。
首先,使用真正的加密算法。 AES 专为高性能和低 RAM 要求而设计。您还可以使用像我之前提到的 RC4 这样的流密码 - 然而,使用 RC4 需要注意的是,您必须丢弃密码输出的前 4 KB 左右,否则您将很容易受到相同的影响困扰 WEP 的攻击。
其次,考虑密钥管理。一种选择是简单地将密钥刻录到每个客户端中,如果客户端受到威胁,则亲自出去更换它。如果您可以轻松地物理访问所有客户端,这是合理的。
否则,如果您不关心中间人攻击,则可以简单地使用 Diffie-Hellman 密钥交换 用于协商 S 和 C 之间的共享密钥。如果您担心 MitM,那么您需要开始查看 ECDSA 或用于验证从 DH 交换获得的密钥的东西 - 但请注意,当您开始沿着这条路走下去时,很容易出错。我建议此时实施 TLS。它并没有超出嵌入式系统的能力 - 事实上,有一个 数字 嵌入式 商业(并开放源) 库 已经可用。如果您不实施 TLS,那么在实施之前至少请专业密码学家检查您的算法。
Okay, since you've clarified what you're trying to do, I rewrote my answer.
To summarize: Use a real encryption algorithm.
First, let me go over why your hashing system is a bad idea.
What is your hashing system, anyway?
As I understand it, your proposed system is something like this:
Your embedded system (which I will call C) is sending some sort of data with a value space of 10^11. This data needs to be kept confidential in transit to some sort of server (which I will call S).
Your proposal is to send the value
hash(salt + data)
to S. S will then use a rainbow table to reverse this hash and recover the data.salt
is a shared value known to both C and S.This is an encryption algorithm
An encryption algorithm, when you boil it down, is any algorithm that gives you confidentiality. Since your goal is confidentiality, any algorithm that satisfies your goals is an encryption algorithm, including this one.
This is a very poor encryption algorithm
First, there is an unavoidable chance of collision. Moreover, the set of colliding values differs each day.
Second, decryption is extremely CPU- and memory-intensive even for the legitimate server S. Changing the salt is even more expensive.
Third, although your stated goal is avoiding key management, your salt is a key! You haven't solved key management at all; anyone with the salt will be able to crack the message just as well as you can.
Fourth, it's only usable from C to S. Your embedded system C will not have enough computational resources to reverse hashes, and can only send data.
This isn't any faster than a real encryption algorithm on the embedded device
Most secure hashing algorithm are just as computationally expensive as a reasonable block cipher, if not worse. For example, SHA-1 requires doing the following for each 512-bit block:
There is one chunk per 512-bits of the message, plus a possible extra chunk at the end. This is 1136 binary operations per chunk (not counting memory operations), or about 16 operations per byte.
For comparison, the RC4 encryption algorithm requires four operations (three additions, plus an xor on the message) per byte, plus two array reads and two array writes. It also requires only 258 bytes of working memory, vs a peak of 368 bytes for SHA-1.
Key management is fundamental
With any confidentiality system, you must have some sort of secret. If you have no secrets, then anyone else can implement the same decoding algorithm, and your data is exposed to the world.
So, you have two choices as to where to put the secrecy. One option is to make the encipherpent/decipherment algorithms secret. However, if the code (or binaries) for the algorithm is ever leaked, you lose - it's quite hard to replace such an algorithm.
Thus, secrets are generally made easy to replace - this is what we call a key.
Your proposed usage of hash algorithms would require a salt - this is the only secret in the system and is therefore a key. Whether you like it or not, you will have to manage this key carefully. And it's a lot harder to replace if leaked than other keys - you have to spend many CPU-hours generating a new rainbow table every time it's changed!
What should you do?
Use a real encryption algorithm, and spend some time actually thinking about key management. These issues have been solved before.
First, use a real encryption algorithm. AES has been designed for high performance and low RAM requirements. You could also use a stream cipher like RC4 as I mentioned before - the thing to watch out for with RC4, however, is that you must discard the first 4 kilobytes or so of output from the cipher, or you will be vulnerable to the same attacks that plauged WEP.
Second, think about key management. One option is to simply burn a key into each client, and physically go out and replace it if the client is compromised. This is reasonable if you have easy physical access to all of the clients.
Otherwise, if you don't care about man-in-the-middle attacks, you can simply use Diffie-Hellman key exchange to negotiate a shared key between S and C. If you are concerned about MitMs, then you'll need to start looking at ECDSA or something to authenticate the key obtained from the D-H exchange - beware that when you start going down that road, it's easy to get things wrong, however. I would recommend implementing TLS at that point. It's not beyond the capabilities of an embedded system - indeed, there are a number of embedded commercial (and open source) libraries available already. If you don't implement TLS, then at least have a professional cryptographer look over your algorithm before implementing it.
显然,不存在“完美”哈希这样的东西,除非您至少有与输入一样多的哈希桶;如果不这样做,那么您的两个输入不可避免地可能会共享同一个哈希桶。
但是,您不太可能存储 0 到 10^11 之间的所有数字。那么模式是什么呢?如果存在模式,则可能存在适合您的实际数据集的完美哈希函数。
不过,无论如何,找到一个“完美”的哈希函数实际上并不那么重要。哈希表非常快。冲突率非常低的函数 - 当散列整数时,这意味着几乎任何简单的函数,例如模数 - 都很好,您将获得 O(1) 平均性能。
There is obviously no such thing as a "perfect" hash unless you have at least as many hash buckets as inputs; if you don't, then inevitably it will be possible for two of your inputs to share the same hash bucket.
However, it's unlikely you'll be storing all the numbers between 0 and 10^11. So what's the pattern? If there's a pattern, there may be a perfect hash function for your actual data set.
It's really not that important to find a "perfect" hash function anyway, though. Hash tables are very fast. A function with a very low collision rate - and when hashing integers, that means nearly any simple function, like modulus - is fine and you'll get O(1) average performance.