动态增长/流数据的哈希算法?

发布于 2024-11-05 01:30:38 字数 625 浏览 5 评论 0原文

是否有任何算法可以从已知的哈希摘要中继续进行哈希处理?例如,客户端上传一个文件块到ServerA,我可以得到上传内容的md5和,然后客户端将文件块的其余部分上传到ServerB,我可以传输状态md5 内部到 ServerB 并完成哈希?

有一个很酷的黑魔法黑客,基于md5 我几年前在 comp.lang.python 中找到,但它使用 ctypes 来表示 md5.so_md5.dll 的特定版本,因此对于不同的 python 解释器版本或其他编程语言来说,它不是完全可移植的代码。此外,从 2.5 开始,md5 模块在 python 标准库中已被弃用,所以我需要找到一个更通用的解决方案。

更重要的是,哈希的状态可以存储在十六进制摘要本身中吗? (因此我可以继续使用现有的哈希摘要对数据流进行哈希处理,而不是肮脏的内部黑客攻击。)

Are there any algorithms that you can continue hashing from a known hash digest? For example, the client upload a chunk of file to ServerA, I can get a md5 sum of the uploaded content, then the client upload the rest of the file chunk to ServerB, can I transfer the state of md5 internals to ServerB and finish the hashing?

There was a cool black magic hack based on md5 I found years ago at comp.lang.python, but it's using ctypes for a specific version of md5.so or _md5.dll, so it's not quite portable code for different python interpreter versions or other programming languages. Besides, the md5 module is deprecated in python standard library since 2.5 so I need to find a more general solution.

What's more, can the state of the hashing be stored in the hex digest itself? (So I can continue hashing a stream of data with an existing hash digest, not a dirty internal hack.)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

冷…雨湿花 2024-11-12 01:30:38

不是来自已知的摘要,而是来自已知的状态。您可以使用纯 python MD5 实现并保存其状态。 是使用 PyPy 中的 _md5.py 的示例:

import _md5

def md5_getstate(md):
    return (md.A, md.B, md.C, md.D, md.count + [], md.input + [], md.length)

def md5_continue(state):
    md = _md5.new()
    (md.A, md.B, md.C, md.D, md.count, md.input, md.length) = state
    return md

m1 = _md5.new()
m1.update("hello, ")
state = md5_getstate(m1)
m2 = md5_continue(state)
m2.update("world!")
print m2.hexdigest()

m = _md5.new()
m.update("hello, world!")
print m.hexdigest()

以下 e.dan 指出,您还可以使用几乎任何校验算法(CRC、Adler、Fletcher),但它们不能很好地保护您免受有意的数据修改,只能防止随机错误。

编辑:当然,您还可以使用您引用的线程中的 ctypes 以更可移植的方式重新实现序列化方法(无需魔术常量)。我相信这应该是版本/架构独立的(在 python 2.4-2.7、i386 和 x86_64 上测试):

# based on idea from http://groups.google.com/group/comp.lang.python/msg/b1c5bb87a3ff5e34

try:
    import _md5 as md5
except ImportError:
    # python 2.4
    import md5
import ctypes

def md5_getstate(md):
    if type(md) is not md5.MD5Type:
        raise TypeError, 'not an MD5Type instance'
    return ctypes.string_at(id(md) + object.__basicsize__,
                            md5.MD5Type.__basicsize__ - object.__basicsize__)

def md5_continue(state):
    md = md5.new()
    assert len(state) == md5.MD5Type.__basicsize__ - object.__basicsize__, \
           'invalid state'    
    ctypes.memmove(id(md) + object.__basicsize__,
                   ctypes.c_char_p(state),
                   len(state))
    return md

m1 = md5.new()
m1.update("hello, ")
state = md5_getstate(m1)
m2 = md5_continue(state)
m2.update("world!")
print m2.hexdigest()

m = md5.new()
m.update("hello, world!")
print m.hexdigest()

它不兼容 Python 3,因为它没有 _md5/md5 模块。

不幸的是,hashlib 的 openssl_md5 实现不适合此类黑客攻击,因为 OpenSSL EVP API 不提供任何调用/方法来可靠地序列化 EVP_MD_CTX 对象。

Not from the known digest, but from the known state. You can use a pure python MD5 implementation and save its state. Here is an example using _md5.py from from PyPy:

import _md5

def md5_getstate(md):
    return (md.A, md.B, md.C, md.D, md.count + [], md.input + [], md.length)

def md5_continue(state):
    md = _md5.new()
    (md.A, md.B, md.C, md.D, md.count, md.input, md.length) = state
    return md

m1 = _md5.new()
m1.update("hello, ")
state = md5_getstate(m1)
m2 = md5_continue(state)
m2.update("world!")
print m2.hexdigest()

m = _md5.new()
m.update("hello, world!")
print m.hexdigest()

As e.dan noted, you can also use almost any checksuming algorithm (CRC, Adler, Fletcher), but they do not protect you well from the intentional data modification, only from the random errors.

EDIT: of course, you can also re-implement the serialization method using ctypes from the thread you referenced in a more portable way (without magic constants). I believe this should be version/architecture independent (tested on python 2.4-2.7, both i386 and x86_64):

# based on idea from http://groups.google.com/group/comp.lang.python/msg/b1c5bb87a3ff5e34

try:
    import _md5 as md5
except ImportError:
    # python 2.4
    import md5
import ctypes

def md5_getstate(md):
    if type(md) is not md5.MD5Type:
        raise TypeError, 'not an MD5Type instance'
    return ctypes.string_at(id(md) + object.__basicsize__,
                            md5.MD5Type.__basicsize__ - object.__basicsize__)

def md5_continue(state):
    md = md5.new()
    assert len(state) == md5.MD5Type.__basicsize__ - object.__basicsize__, \
           'invalid state'    
    ctypes.memmove(id(md) + object.__basicsize__,
                   ctypes.c_char_p(state),
                   len(state))
    return md

m1 = md5.new()
m1.update("hello, ")
state = md5_getstate(m1)
m2 = md5_continue(state)
m2.update("world!")
print m2.hexdigest()

m = md5.new()
m.update("hello, world!")
print m.hexdigest()

It is not Python 3 compatible, since it does not have an _md5/md5 module.

Unfortunately hashlib's openssl_md5 implementation is not suitable for such hacks, since OpenSSL EVP API does not provide any calls/methods to reliably serialize EVP_MD_CTX objects.

眼泪也成诗 2024-11-12 01:30:38

这在理论上是可能的(md5 到目前为止应该包含您需要继续的所有状态),但看起来普通的 API 无法提供您需要的内容。如果您可以使用 CRC 来代替,这可能会容易得多,因为它们更常用于您需要的“流”情况。请参阅此处:

binascii.crc32(data[, crc])

crc32() 接受可选的 crc 输入,这是要继续的校验和。

希望有帮助。

This is theoretically possible (the md5 so far should contain all the state you need to continue) but it looks like the normal APIs don't provide what you need. If you can suffice with a CRC instead, this will probably be a lot easier, since those are more commonly used for the "streaming" cases like you need. See here:

binascii.crc32(data[, crc])

crc32() accepts an optional crc input which is the checksum to continue from.

Hope that helps.

南笙 2024-11-12 01:30:38

我也面临这个问题,并且没有找到现有的解决方案,所以我编写了一个库,使用 ctypes 来解构保存哈希器状态的 OpenSSL 数据结构: https://github.com/kislyuk/rehash。例子:

import pickle, rehash
hasher = rehash.sha256(b"foo")
state = pickle.dumps(hasher)

hasher2 = pickle.loads(state)
hasher2.update(b"bar")

assert hasher2.hexdigest() == rehash.sha256(b"foobar").hexdigest()

I was facing this problem too, and found no existing solution, so I wrote a library that uses ctypes to deconstruct the OpenSSL data structure holding the hasher state: https://github.com/kislyuk/rehash. Example:

import pickle, rehash
hasher = rehash.sha256(b"foo")
state = pickle.dumps(hasher)

hasher2 = pickle.loads(state)
hasher2.update(b"bar")

assert hasher2.hexdigest() == rehash.sha256(b"foobar").hexdigest()
九八野马 2024-11-12 01:30:38

嗨,对于那些迟到的人来说,就像我对 python3 所做的那样(在我的例子中是 3.11)
hashlib 有更新功能。

import hashlib

a=hashlib.sha256(b"test")
b=hashlib.sha256(b"testtest")

print(a.hexdigest())
print(b.hexdigest())
a.update(b"test")
print(a.hexdigest())

9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08
37268335dd6931045bdcdf92623ff819a64244b53d0e746d438797349d4da578
37268335dd6931045bdcdf92623ff819a64244b53d0e746d438797349d4da578

hi for those coming here late, as i did with python3 (in my case 3.11)
the hashlib has an update function.

import hashlib

a=hashlib.sha256(b"test")
b=hashlib.sha256(b"testtest")

print(a.hexdigest())
print(b.hexdigest())
a.update(b"test")
print(a.hexdigest())

9f86d081884c7d659a2feaa0c55ad015a3bf4f1b2b0b822cd15d6c15b0f00a08
37268335dd6931045bdcdf92623ff819a64244b53d0e746d438797349d4da578
37268335dd6931045bdcdf92623ff819a64244b53d0e746d438797349d4da578

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文