在客户端-服务器架构中识别和比较文件的最佳方法
我有一个用于存储数据集的DataStore
。用户可以将数据集上传到数据存储,但首先检查数据集是否存在于DataStore
中,只有当返回值为false
时才会上传。请参阅序列图:
这是通过识别数据集及其校验和并将客户端校验和与中的校验和进行比较来实现的。 数据存储
。目前的算法是CRC32。经过一些研究后发现,由于生日问题,这可能是不安全的: 使用 CRC32,对于 1% 的碰撞概率,需要 9300 个数据集,对于 25% 的概率需要 5000 个数据集。
数据表明,CRC32 的风险非常大。校验和需要很容易计算,因此不会给客户端带来太大的负担。有没有办法(可能是一个棘手的二次检查)来判断具有匹配校验和的数据集不同?或者唯一的方法是通过考虑数据集的最大数量来选择具有更多位的函数?
PS:我知道;已经提出了有关文件比较的各种问题,但我找不到任何可以回答我所有问题的问题。
I have a DataStore
made for the purpose of storing datasets. The users can upload datasets to the datastore, but first they check if the dataset exists in the DataStore
, and it only gets uploaded if the returned value is false
. See sequence diagram:
This is implemented by identifying datasets with their checksums and comparing the clients checksum with the ones in the DataStore
. The algorithm for now is CRC32. After some research it came clear that this can be unsafe, due to the Birthday-problem: With CRC32, for 1% probability of collision, there needs to be 9300 datasets, and 5000 datasets for 25% probability.
The numbers tell this is very much risky with CRC32. The checksum needs to be easily calculated, so it does not put too much load on the client. Is there a way – a tricky secondary check maybe – to tell that the datasets with the matching checksums differ? Or the only way is to pick a function with more bits by considering the maximum amount of datasets?
PS: I know; all kinds of questions about file comparison have been asked already, but I couldn't find any that answers all of my questions.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
它无法避免冲突,因此使用具有更多位的哈希来最小化冲突的可能性。
It can't avoid the collision so use hash with more bits to minimize a probability of the collision.