无需完整下载即可读取 zip 文件
是否可以在不完全下载 .ZIP 文件的情况下读取其内容?
我正在构建一个爬虫,我不想下载每个 zip 文件只是为了索引它们的内容。
谢谢;
Is it possible to read the contents of a .ZIP file without fully downloading it?
I'm building a crawler and I'd rather not have to download every zip file just to index their contents.
Thanks;
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
棘手的部分是识别中央目录的开始,它出现在文件的末尾。由于每个条目具有相同的固定大小,因此您可以从文件末尾开始进行二进制搜索。二分搜索试图猜测中央目录中有多少条目。从一些合理的值 N 开始,并在末尾检索文件的该部分 - (N*sizeof(DirectoryEntry))。如果该文件位置不是以中央目录项签名开始,则 N 太大 - 一半并重复,否则,N 太小,加倍并重复。与二分搜索一样,该过程维护当前的上限和下限。当两者相等时,您就找到了 N 的值,即条目数。
您访问网络服务器的次数最多为 16 次,因为条目不能超过 64K。
这是否比下载整个文件更有效取决于文件大小。您可以在下载之前请求资源的大小,如果它小于给定的阈值,则下载整个资源。对于大型资源,如果阈值设置得较高,则请求多个偏移量会更快,并且总体上对网络服务器的负担会更少。
HTTP/1.1 允许下载一定范围的资源。对于 HTTP/1.0,您别无选择,只能下载整个文件。
The tricky part is in identifying the start of the central directory, which occurs at the end of the file. Since each entry is the same fixed size, you can do a kind of binary search starting from the end of the file. The binary search is trying to guess how many entries are in the central directory. Start with some reasonable value, N, and retrieve that portion of the file at end-(N*sizeof(DirectoryEntry)). If that file position does not start with the central directory entry signature, then N is too large - half and repeat, otherwise, N is too small, double and repeat. Like binary search, the process maintains the current upper and lower bound. When the two become equal, you've found the value for N, the number of entries.
The number of times you hit the webserver, is at most 16, since there can be no more than 64K entries.
Whether this is more efficient than downloading the whole file depends on the file size. You might request the size of the resource before downloading, and if it's smaller than a given threshold, download the entire resource. For large resources, requesting multiple offsets will be quicker, and overall less taxing for the webserver, if the threshold is set high.
HTTP/1.1 allows ranges of a resource to be downloaded. For HTTP/1.0 you have no choice but to download the whole file.
格式表明有关文件内容的关键信息位于结束了。然后,条目被指定为该特定条目的偏移量,因此您需要访问我认为的整个内容。
我相信 GZip 格式可以作为流读取。
the format suggests that the key piece of information about what's in the file resides at the end of it. Entries are then specified as an offset from that particular entry, so you'll need to have access to the whole thing I believe.
GZip formats are able to be read as a stream I believe.
这是有可能的。您所需要的只是允许读取范围内的字节、获取结束记录(以了解 CD 的大小)、获取中央目录(以了解文件开始和结束位置)的服务器,然后获取正确的字节并处理它们。
这是 pyhon 中的实现: onlinezip
[完整披露:我是库的作者]
It's possible. All you need is server that allows to read bytes in ranges, fetch end recored (to know size of CD), fetch central directory (to know where file starts and ends) and then fetch proper bytes and handle them.
Here is implementation in pyhon: onlinezip
[full disclosure: I'm author of library]
ArchView 中实现了一个解决方案
“ArchView 可以在线打开存档文件,而无需下载整个存档。”
https://addons.mozilla.org/en-US/firefox/addon /5028/
在 archview-0.7.1.xpi 的“archview.js”文件中,您可以查看他们的 javascript 方法。
There is a solution implemented in ArchView
"ArchView can open archive file online without downloading the whole archive."
https://addons.mozilla.org/en-US/firefox/addon/5028/
Inside the archview-0.7.1.xpi in the file "archview.js" you can look at their javascript approach.
我不知道这是否有帮助,因为我不是程序员。但在 Outlook 中,您可以预览 zip 文件并查看实际内容,而不仅仅是文件目录(如果它们是可预览文档,例如 pdf)。
I don't know if this helps, as I'm not a programmer. But in Outlook you can preview zip files and see the actual content, not just the file directory (if they are previewable documents like a pdf).