如何为位于 Http URL 中的文件生成 MD5 哈希值?
我正在编写一个网络爬虫来搜索文件并下载。我的问题是我不想下载已下载到本地驱动器的相同文件。我知道可以使用 MD5 哈希进行比较,但是如何在 HTTP URL 上执行此操作而不将它们下载到本地磁盘?
如果这种做法是错误的。请建议更好的解决方案
I am writing a web crawler to search for files and download. My problem is I do not want to download the same files that are downloaded already to the local drive. I know it's possible to use the MD5 hash to compare but how can I do this on HTTP URL without downloading them to the local disk?
If this approach is wrong. Please advice on a better solution
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
除非网络服务器有某种共享 MD5 的服务,否则不行。
计算文件哈希需要文件中的每个字节。这就是为什么更改单个字节会更改哈希值,以防止文件被修改。
Unless the webserver has some sort of service that shares the MD5, then No.
Computing a file hash requires every byte in the file. This is why changing a single byte changes the hash, to prevent getting modified files.
要生成哈希,您将需要数据(即,您需要以某种方式下载它)。
我建议您使用
If-Modified-Since
HTTP 进行调查标头(或者也许ETag
/If-None-Match
,如果特定服务器提供)。To generate a hash you're going to need the data (ie, you'll need to download it somehow).
I would suggest that you investigate using the
If-Modified-Since
HTTP header instead (or maybeETag
/If-None-Match
, if the particular server provides it).您能够对远程文件执行的唯一比较是大小比较。不幸的是,这可能不足以确定内容是否相同。
The only comparison you will be able to perform on a remote file is a size comparison. Unfortunately, this is probably not enough to determine that the contents are identical or not.
老问题,但 PowerShell 5+ 可以通过自动将远程 Url 文件下载为字节流,然后一步计算 MD5 来帮助获取远程 Url 文件的 MD5:
Old question, but PowerShell 5+ can help to get MD5 of remote Url file by auto downloading it as a stream of bytes, then computing MD5 in one step: