计算“基于”数据校验和。（SHA1/2 等）

发布于 2024-10-10 13:24:04 字数 1489 浏览 8 评论 0原文

我不确定到底如何问这个问题，但这就是我所希望的，给定一个可以包含 5+n 个键的结构（因此，我的系统必须有 5 个键，另外键是可选的） - 我想要一个哈希机制，能够确定具有 5 相同键的 6 键哈希是 5 的超集 键结构，并提供附加信息。特别是散列机制，因为存在一些限制，阻止在每个请求上通过线路发送完整的结构。

为了澄清起见，这里有一些信息（示例需要 2+n 键）：

---
  name: codebeaker
  occupation: developer

使用 SHA-512 和 -256 进行哈希处理，结果如下就像：

SHA-512
04fe500f2b3e779aba9ecb171224a04d35cc8453eb1521c7e31fd48b56b1cce9
b1e8af775e177e110982bfb16a6ca8652d7d9812ab8a8c316015dc9d6b3b54f7

SHA-256
4833be7086726e7ffd82db206f94f0a4f9fdf7fba00692f626157afed4587c74

当添加附加键时，（下面的示例）我希望能够推断出扩展数据集是第一个数据集的超集。

---
  name: codebeaker
  occupation: developer
  telephone: 49 (0) 123 45 67

然而，毫不奇怪，在 MD5、SHA-n 和我研究过的任何其他哈希函数中，没有办法做到这一点，例如：（

SHA-512
2fe2c1f01e39506010ea104581b737f95db6b6f71b1497788afc80a4abe26ab0
fc4913054278af69a89c152406579b7b00c3d4eb881982393a1ace83aeb7b6a2

SHA-256
77c2942e9095e55e13c548e5ef1f874396bfb64f7653e4794d6d91d0d3a168e2

显然）没有相似之处...

我们的用例，这些格式化为结构的数据由第三方输入到我们的系统中。处理数据非常昂贵，每次操作需要 2-3 秒，如果我们知道之前运行的结果，我们可以收回大约 50% 的时间，但是 - 贝叶斯和 Levenstein 文本差异算法则不然适合这里，因为我们经常看到缩写词的键/值对，以及完全不相关时可能看起来相似的其他文本。

我们需要的是一种校验和数据的方法（我在这里可能会偏向我的回答） - 这样我们就可以确定 B 是 A 的超集（如果它包含所有的）相同的密钥，具有相同的数据。然而，我们的 struc 中的键/值条目中通常有如此多的数据，以至于每次通过网络发送它，只是为了确定我们已经看到了更完整的副本，这将是昂贵且浪费的。

原文

I'm not sure exactly how to ask this, but here's what I'm hoping for, given a structure that could contain 5+n keys (thus, there are 5 keys mandatory to my system, additional keys are optional) - I would like a hashing mechanism that is able to determine that a 6 key hash, with 5 identical keys, is a superset of the 5 key struct, and offers additional information. Specifically a hashing mechanism, as there are constraints which preclude sending the complete struct over the wire on every request.

For clarification, here's some information (sample requires 2+n keys):

---
  name: codebeaker
  occupation: developer

Hashed with SHA-512, and -256 this comes out to look like:

SHA-512
04fe500f2b3e779aba9ecb171224a04d35cc8453eb1521c7e31fd48b56b1cce9
b1e8af775e177e110982bfb16a6ca8652d7d9812ab8a8c316015dc9d6b3b54f7

SHA-256
4833be7086726e7ffd82db206f94f0a4f9fdf7fba00692f626157afed4587c74

When adding an additional key, (example below) I would like to be able to deduce that the extended dataset is a superset of the first.

---
  name: codebeaker
  occupation: developer
  telephone: 49 (0) 123 45 67

However, unsurprisingly, in MD5, SHA-n and any other hashing function I have looked into, there's no way to do this, example:

SHA-512
2fe2c1f01e39506010ea104581b737f95db6b6f71b1497788afc80a4abe26ab0
fc4913054278af69a89c152406579b7b00c3d4eb881982393a1ace83aeb7b6a2

SHA-256
77c2942e9095e55e13c548e5ef1f874396bfb64f7653e4794d6d91d0d3a168e2

(Obviously) there are no similarities...

Our use case, this data, formatted as a struct is fed into our system by a 3rd party. Processing the data is hugely expensive, 2-3 seconds per operation, we can get about 50% of that time back, if we know we have a result from a previous run, however - Bayesian, and Levenstein text-difference algorithms aren't suitable here, as we often see key/value pairs that are acronyms, and other text which can appear similar, when being completely unrelated.

What we need is a way to checksum data (I might be biasing my response here) - so that we can determine that B is a superset of A if it contains all the same keys, with the same data. However, often there is so much data in the key/value entries in our struc that sending it over the wire every time, only to determine that we already saw a more complete copy, would be expensive and wasteful.

分享到QQ

分享到微博