计算“基于”数据校验和。 (SHA1/2 等)
我不确定到底如何问这个问题,但这就是我所希望的,给定一个可以包含 5+n 个键的结构(因此,我的系统必须有 5 个键,另外键是可选的) - 我想要一个哈希机制,能够确定具有 5
相同键的 6
键哈希是 5 的超集
键结构,并提供附加信息。特别是散列机制,因为存在一些限制,阻止在每个请求上通过线路发送完整的结构。
为了澄清起见,这里有一些信息(示例需要 2+n
键):
---
name: codebeaker
occupation: developer
使用 SHA-512
和 -256
进行哈希处理,结果如下就像:
SHA-512
04fe500f2b3e779aba9ecb171224a04d35cc8453eb1521c7e31fd48b56b1cce9
b1e8af775e177e110982bfb16a6ca8652d7d9812ab8a8c316015dc9d6b3b54f7
SHA-256
4833be7086726e7ffd82db206f94f0a4f9fdf7fba00692f626157afed4587c74
当添加附加键时,(下面的示例)我希望能够推断出扩展数据集是第一个数据集的超集。
---
name: codebeaker
occupation: developer
telephone: 49 (0) 123 45 67
然而,毫不奇怪,在 MD5
、SHA-n
和我研究过的任何其他哈希函数中,没有办法做到这一点,例如:(
SHA-512
2fe2c1f01e39506010ea104581b737f95db6b6f71b1497788afc80a4abe26ab0
fc4913054278af69a89c152406579b7b00c3d4eb881982393a1ace83aeb7b6a2
SHA-256
77c2942e9095e55e13c548e5ef1f874396bfb64f7653e4794d6d91d0d3a168e2
显然)没有相似之处...
我们的用例,这些格式化为结构的数据由第三方输入到我们的系统中。处理数据非常昂贵,每次操作需要 2-3 秒,如果我们知道之前运行的结果,我们可以收回大约 50% 的时间,但是 - 贝叶斯和 Levenstein 文本差异算法则不然适合这里,因为我们经常看到缩写词的键/值对,以及完全不相关时可能看起来相似的其他文本。
我们需要的是一种校验和数据的方法(我在这里可能会偏向我的回答) - 这样我们就可以确定 B
是 A
的超集(如果它包含所有的)相同的密钥,具有相同的数据。然而,我们的 struc
中的键/值条目中通常有如此多的数据,以至于每次通过网络发送它,只是为了确定我们已经看到了更完整的副本,这将是昂贵且浪费的。
I'm not sure exactly how to ask this, but here's what I'm hoping for, given a structure that could contain 5+n
keys (thus, there are 5 keys mandatory to my system, additional keys are optional) - I would like a hashing mechanism that is able to determine that a 6
key hash, with 5
identical keys, is a superset of the 5
key struct, and offers additional information. Specifically a hashing mechanism, as there are constraints which preclude sending the complete struct over the wire on every request.
For clarification, here's some information (sample requires 2+n
keys):
---
name: codebeaker
occupation: developer
Hashed with SHA-512
, and -256
this comes out to look like:
SHA-512
04fe500f2b3e779aba9ecb171224a04d35cc8453eb1521c7e31fd48b56b1cce9
b1e8af775e177e110982bfb16a6ca8652d7d9812ab8a8c316015dc9d6b3b54f7
SHA-256
4833be7086726e7ffd82db206f94f0a4f9fdf7fba00692f626157afed4587c74
When adding an additional key, (example below) I would like to be able to deduce that the extended dataset is a superset of the first.
---
name: codebeaker
occupation: developer
telephone: 49 (0) 123 45 67
However, unsurprisingly, in MD5
, SHA-n
and any other hashing function I have looked into, there's no way to do this, example:
SHA-512
2fe2c1f01e39506010ea104581b737f95db6b6f71b1497788afc80a4abe26ab0
fc4913054278af69a89c152406579b7b00c3d4eb881982393a1ace83aeb7b6a2
SHA-256
77c2942e9095e55e13c548e5ef1f874396bfb64f7653e4794d6d91d0d3a168e2
(Obviously) there are no similarities...
Our use case, this data, formatted as a struct is fed into our system by a 3rd party. Processing the data is hugely expensive, 2-3 seconds per operation, we can get about 50% of that time back, if we know we have a result from a previous run, however - Bayesian, and Levenstein text-difference algorithms aren't suitable here, as we often see key/value pairs that are acronyms, and other text which can appear similar, when being completely unrelated.
What we need is a way to checksum data (I might be biasing my response here) - so that we can determine that B
is a superset of A
if it contains all the same keys, with the same data. However, often there is so much data in the key/value entries in our struc
that sending it over the wire every time, only to determine that we already saw a more complete copy, would be expensive and wasteful.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
一个想法是每个键值对使用不同的哈希值。因此,完整结构的“散列”是散列的集合。
如果您的用例始终是按相同顺序排列的五个相同键,然后是任何其他键,您可以使用一个散列作为强制键,一个散列作为可选键 - 但您将无法检测到包含可选键的一个结构是包含可选键的另一个结构的超集。
一种细微的变化是对所需的键使用一个散列,对整个结构使用一个散列。
您还可以(根据您的要求)对键值对使用较小的校验和,以便能够快速丢弃不相同的内容 - 但仍然需要更大的哈希值才能更准确地确定某些内容是否匹配。
An idea would be to use different hashes per key-value pair. The "hash" of the complete struct is therefore a collection of hashes.
If your use case is always five identical keys in the same order and then any additional keys you could use one hash for the mandatory keys and one for the optional keys - but you would then be unable to detect that one struct containing optional keys is the superset of another struct containing optional keys.
A slight variation is to use one hash for the required keys and one for the entire struct.
You could also (depending on your requirements) use smaller checksums for the key-value pairs to be able to quickly discard something as not being the same - but larger hashes would still be needed to more accurately determine that something is a match.
加密哈希是专门为具有以下属性而设计的:
因此,加密散列实际上可以用作任何二进制数据的唯一标识符。甚至“name: codebeaker”的哈希值也与“name: Codebeaker”不同。
如果您的一组键是固定的,按固定顺序,始终完整并且仅由新键扩展,并且每个键只有一个允许的表示形式,那么您可以计算五个旧键的哈希值并将其与现有的哈希值进行比较当前集。
如果键始终是唯一的,但集合可以混合,那么您可以为每个键计算单独的哈希值,并在单独的数据库中存储和搜索现有集合。
除此之外,加密哈希可能不是完成这项工作的正确工具。
[编辑]
另一种方法是首先按字母顺序对键进行排序,然后从排序集中获取哈希值。现在,这可以识别您的集合,而无需关心顺序。更实际的做法可能是首先获取单个键的单独哈希值,对哈希值进行排序,然后对已排序哈希值的列表获取哈希值。这仍然需要唯一的键。
Cryptographic hashes are specifically designed with these properties:
Thus a cryptographic hash can and actually is used as a unique identifier for any binary data. Even "name: codebeaker" has a different hash than "name: Codebeaker".
If your set of keys is fixed, in a fixed order, always complete and only extended by new keys, and each key only has one allowed representation, then you can calculate the hash of the five old keys and compare it to the existing hashes of the current sets.
If the keys are always unique, but the sets can be mixed, then you can calculate a separate hash for each key and store and search these for the existing sets in a separate database.
Beyond this, cryptographic hashes may not be the right tool for the job.
[Edit]
Another approach is to first alphabetically sort the keys and then take the hash value from the sorted set. This now identifies your set without needing to care for the order. It may be more practical to first take the individual hashes of the single keys, sort the hashes instead and take the hash over the list of sorted hashes. This still requires uniques keys.