Web 服务通常需要压缩多个大文件以供客户端下载。最明显的方法是创建一个临时 zip 文件,然后将其回显
给用户或将其保存到磁盘并重定向(将来某个时候将其删除)。
然而,以这种方式做事也有缺点:
- 初始阶段会出现密集的 CPU 和磁盘抖动,从而导致...
- 在准备存档时给用户带来相当大的初始延迟
- 每个请求占用大量内存
- 使用大量临时磁盘空间
- 如果需要 用户中途取消下载,初始阶段使用的所有资源(CPU、内存、磁盘)都将被浪费
解决方案如 ZipStream-PHP 通过将数据逐个文件地铲入 Apache 来对此进行改进。然而,结果仍然是内存使用率很高(文件完全加载到内存中),并且磁盘和 CPU 使用率出现大幅飙升。
相反,请考虑以下 bash 代码片段:
ls -1 | zip -@ - | cat > file.zip
# Note -@ is not supported on MacOS
这里,zip
在流模式下运行,从而减少内存占用。管道有一个完整的缓冲区——当缓冲区满时,操作系统暂停写入程序(管道左侧的程序)。这里确保 zip
的运行速度与 cat
写入其输出的速度相同。
那么,最佳方法是执行相同的操作:将 cat
替换为 Web 服务器进程,将动态创建的 zip 文件流给用户。与仅流式传输文件相比,这会产生很少的开销,并且不会产生问题、不会出现尖峰的资源配置文件。
如何在 LAMP 堆栈上实现这一目标?
Often a web service needs to zip up several large files for download by the client. The most obvious way to do this is to create a temporary zip file, then either echo
it to the user or save it to disk and redirect (deleting it some time in the future).
However, doing things that way has drawbacks:
- a initial phase of intensive CPU and disk thrashing, resulting in...
- a considerable initial delay to the user while the archive is prepared
- very high memory footprint per request
- use of substantial temporary disk space
- if the user cancels the download half way through, all resources used in the initial phase (CPU, memory, disk) will have been wasted
Solutions like ZipStream-PHP improve on this by shovelling the data into Apache file by file. However, the result is still high memory usage (files are loaded entirely into memory), and large, thrashy spikes in disk and CPU usage.
In contrast, consider the following bash snippet:
ls -1 | zip -@ - | cat > file.zip
# Note -@ is not supported on MacOS
Here, zip
operates in streaming mode, resulting in a low memory footprint. A pipe has an integral buffer – when the buffer is full, the OS suspends the writing program (program on the left of the pipe). This here ensures that zip
works only as fast as its output can be written by cat
.
The optimal way, then, would be to do the same: replace cat
with a web server process, streaming the zip file to the user with it created on the fly. This would create little overhead compared to just streaming the files, and would have an unproblematic, non-spiky resource profile.
How can you achieve this on a LAMP stack?
发布评论
评论(7)
您可以使用
popen()
(docs) 或proc_open()
(docs) 执行 unix 命令(例如 zip 或 gzip),并以 php 流的形式返回 stdout。flush()
(docs) 就可以了最好将 php 输出缓冲区的内容推送到浏览器。结合所有这些将为您提供您想要的东西(前提是没有其他任何阻碍 - 特别是参见文档页面上的
flush()
的警告)。(注意:不要使用
flush()
。有关详细信息,请参阅下面的更新。)类似以下内容可以解决问题:
您询问了“其他技术”:我会说,“任何在请求的整个生命周期中支持非阻塞 I/O 的东西”。您可以使用 Java 或 C/C++(或任何其他可用语言)构建这样的组件作为独立服务器,如果您愿意陷入非- 阻止文件访问等等。
如果你想要一个非阻塞的实现,但你宁愿避免“沮丧和肮脏”,最简单的路径(恕我直言)是使用 nodeJS。在现有的 Nodejs 版本中,对您需要的所有功能都有足够的支持:使用
http
模块(当然)作为 http 服务器;并使用child_process
模块生成 tar/zip/whatever 管道。最后,如果(且仅当)您正在运行多处理器(或多核)服务器,并且您希望充分利用 Nodejs,则可以使用 Spark2 在同一端口上运行多个实例。每个处理器核心不要运行多个 Nodejs 实例。
更新(来自 Benji 在本答案评论部分的出色反馈)
1.
fread()
的文档表明该函数是只读的对于非常规文件,一次最多可传输 8192 字节的数据。因此,8192 可能是一个不错的缓冲区大小选择。[编者注] 8192 几乎肯定是一个平台相关的值 - 在大多数平台上,
fread()
将读取数据,直到操作系统的内部缓冲区为空,此时它将返回,从而允许操作系统再次异步填充缓冲区。 8192 是许多流行操作系统上默认缓冲区的大小。还有其他情况可能导致 fread 返回甚至小于 8192 字节 - 例如,“远程”客户端(或进程)填充缓冲区的速度很慢 - 在大多数情况下,
fread()
将按原样返回输入缓冲区的内容,而不等待它变满。这可能意味着返回 0..os_buffer_size 字节之间的任何位置。其寓意是:您传递给
fread()
作为buffsize
的值应该被视为“最大”大小——永远不要假设您已经收到了您想要的字节数。要求(或与此相关的任何其他号码)。2. 根据 fread 文档的评论,有一些注意事项: magic引号可能会产生干扰,必须关闭。
3. 设置
mb_http_output('pass')
(docs) 可能是个好主意。尽管'pass'
已经是默认设置,但如果您的代码或配置之前已将其更改为其他内容,您可能需要显式指定它。4. 如果您要创建 zip(而不是 gzip),您需要使用内容类型标头:
或者...可以使用“application/octet-stream”。 (它是用于所有不同类型的二进制下载的通用内容类型):
如果您希望提示用户下载文件并将其保存到磁盘(而不是让浏览器尝试将文件显示为文本),那么您将需要内容处置标头。 (其中文件名表示应在保存对话框中建议的名称):
还应该发送 Content-length 标头,但这对于这种技术来说很困难,因为您事先不知道 zip 的确切大小。 是否可以设置标头来指示内容是“流式传输”或长度未知?有人知道吗?
最后,这是一个修改后的示例,它使用了所有@Benji的建议(以及创建一个 ZIP 文件而不是 TAR.GZIP 文件):
更新:(2012-11-23) 我发现在 read/echo 循环中调用
flush()
处理非常大的文件和/或非常慢的网络时可能会导致问题。至少,当 PHP 作为 Apache 后面的 cgi/fastcgi 运行时确实如此,并且在其他配置中运行时似乎也可能会出现相同的问题。当 PHP 将输出刷新到 Apache 的速度比 Apache 通过套接字实际发送输出的速度快时,就会出现此问题。对于非常大的文件(或缓慢的连接),这最终会导致 Apache 的内部输出缓冲区溢出。这会导致 Apache 终止 PHP 进程,这当然会导致下载挂起或过早完成,仅发生部分传输。解决方案是根本不调用
flush()
。我已经更新了上面的代码示例以反映这一点,并在答案顶部的文本中添加了注释。You can use
popen()
(docs) orproc_open()
(docs) to execute a unix command (eg. zip or gzip), and get back stdout as a php stream.flush()
(docs) will do its very best to push the contents of php's output buffer to the browser.Combining all of this will give you what you want (provided that nothing else gets in the way -- see esp. the caveats on the docs page for
flush()
).(Note: don't use
flush()
. See the update below for details.)Something like the following can do the trick:
You asked about "other technologies": to which I'll say, "anything that supports non-blocking i/o for the entire lifecycle of the request". You could build such a component as a stand-alone server in Java or C/C++ (or any of many other available languages), if you were willing to get into the "down and dirty" of non-blocking file access and whatnot.
If you want a non-blocking implementation, but you would rather avoid the "down and dirty", the easiest path (IMHO) would be to use nodeJS. There is plenty of support for all the features you need in the existing release of nodejs: use the
http
module (of course) for the http server; and usechild_process
module to spawn the tar/zip/whatever pipeline.Finally, if (and only if) you're running a multi-processor (or multi-core) server, and you want the most from nodejs, you can use Spark2 to run multiple instances on the same port. Don't run more than one nodejs instance per-processor-core.
Update (from Benji's excellent feedback in the comments section on this answer)
1. The docs for
fread()
indicate that the function will read only up to 8192 bytes of data at a time from anything that is not a regular file. Therefore, 8192 may be a good choice of buffer size.[editorial note] 8192 is almost certainly a platform dependent value -- on most platforms,
fread()
will read data until the operating system's internal buffer is empty, at which point it will return, allowing the os to fill the buffer again asynchronously. 8192 is the size of the default buffer on many popular operating systems.There are other circumstances that can cause fread to return even less than 8192 bytes -- for example, the "remote" client (or process) is slow to fill the buffer - in most cases,
fread()
will return the contents of the input buffer as-is without waiting for it to get full. This could mean anywhere from 0..os_buffer_size bytes get returned.The moral is: the value you pass to
fread()
asbuffsize
should be considered a "maximum" size -- never assume that you've received the number of bytes you asked for (or any other number for that matter).2. According to comments on fread docs, a few caveats: magic quotes may interfere and must be turned off.
3. Setting
mb_http_output('pass')
(docs) may be a good idea. Though'pass'
is already the default setting, you may need to specify it explicitly if your code or config has previously changed it to something else.4. If you're creating a zip (as opposed to gzip), you'd want to use the content type header:
or... 'application/octet-stream' can be used instead. (it's a generic content type used for binary downloads of all different kinds):
and if you want the user to be prompted to download and save the file to disk (rather than potentially having the browser try to display the file as text), then you'll need the content-disposition header. (where filename indicates the name that should be suggested in the save dialog):
One should also send the Content-length header, but this is hard with this technique as you don’t know the zip’s exact size in advance. Is there a header that can be set to indicate that the content is "streaming" or is of unknown length? Does anybody know?
Finally, here's a revised example that uses all of @Benji's suggestions (and that creates a ZIP file instead of a TAR.GZIP file):
Update: (2012-11-23) I have discovered that calling
flush()
within the read/echo loop can cause problems when working with very large files and/or very slow networks. At least, this is true when running PHP as cgi/fastcgi behind Apache, and it seems likely that the same problem would occur when running in other configurations too. The problem appears to result when PHP flushes output to Apache faster than Apache can actually send it over the socket. For very large files (or slow connections), this eventually causes in an overrun of Apache's internal output buffer. This causes Apache to kill the PHP process, which of course causes the download to hang, or complete prematurely, with only a partial transfer having taken place.The solution is not to call
flush()
at all. I have updated the code examples above to reflect this, and I placed a note in the text at the top of the answer.另一个解决方案是我的 Nginx mod_zip 模块,专门为此目的而编写:
https://github.com/evanmiller/mod_zip< /a>
它非常轻量级,不会调用单独的“zip”进程或通过管道进行通信。您只需指向一个列出要包含的文件位置的脚本,mod_zip 就会完成其余的工作。
Another solution is my mod_zip module for Nginx, written specifically for this purpose:
https://github.com/evanmiller/mod_zip
It is extremely lightweight and does not invoke a separate "zip" process or communicate via pipes. You simply point to a script that lists the locations of files to be included, and mod_zip does the rest.
尝试使用大量不同大小的文件实现动态生成的下载时,我遇到了此解决方案,但遇到了各种内存错误,例如“允许的内存大小为 134217728 字节,在...耗尽”。
在
flush();
之前添加ob_flush();
后,内存错误消失。连同发送标头,我的最终解决方案如下所示(仅将文件存储在没有目录结构的 zip 中):
Trying to implement a dynamic generated download with lots of files with different sizes i came across this solution but i run into various memory errors like "Allowed memory size of 134217728 bytes exhausted at ...".
After adding
ob_flush();
right before theflush();
the memory errors disappear.Together with sending the headers, my final solution looks like this (Just storing the files inside the zip without directory structure):
我上周末写了这个 s3 流文件拉链微服务 - 可能有用: http://engineroom.teamwork.com/how-to-securely-provide-a-zip-download-of-a-s3-file-bundle/
I wrote this s3 steaming file zipper microservice last weekend - might be useful: http://engineroom.teamwork.com/how-to-securely-provide-a-zip-download-of-a-s3-file-bundle/
根据 PHP 手册,ZIP 扩展 提供了一个 zip: 包装器。
我从未使用过它,也不知道它的内部结构,但从逻辑上讲,假设 ZIP 档案可以流式传输,那么从逻辑上讲它应该能够完成您正在寻找的操作,但我对此并不完全确定。
至于您关于“LAMP 堆栈”的问题,只要 PHP 不是问题,就不应该成为问题配置为缓冲输出。
编辑:我正在尝试将概念验证放在一起,但这似乎并不简单。如果您没有使用 PHP 流的经验,即使可能的话,它也可能太复杂了。
编辑(2):在查看 ZipStream 后重新阅读您的问题,当您说(添加强调)时,我发现这里将成为您的主要问题
这部分将非常难以实现,因为我认为 PHP 没有提供一种方法来确定 Apache 的缓冲区有多满。所以,你的问题的答案是否定的,你可能无法在 PHP 中做到这一点。
According to the PHP manual, the ZIP extension provides a zip: wrapper.
I have never used it and I don't know its internals, but logically it should be able to do what you're looking for, assuming that ZIP archives can be streamed, which I'm not entirely sure of.
As for your question about the "LAMP stack" it shouldn't be a problem as long as PHP is not configured to buffer output.
Edit: I'm trying to put a proof-of-concept together, but it seems not-trivial. If you're not experienced with PHP's streams, it might prove too complicated, if it's even possible.
Edit(2): rereading your question after taking a look at ZipStream, I found what's going to be your main problem here when you say (emphasis added)
That part will be extremely hard to implement because I don't think PHP provides a way to determine how full Apache's buffer is. So, the answer to your question is no, you probably won't be able to do that in PHP.
看来,您可以通过使用 fpassthru()< 来消除任何与输出缓冲区相关的问题/a>.我还使用
-0
来节省 CPU 时间,因为我的数据已经很紧凑了。我使用此代码来提供整个文件夹,即时压缩:It seems, you can eliminate any output-buffer related problems by using fpassthru(). I also use
-0
to save CPU time since my data is compact already. I use this code to serve a whole folder, zipped on-the-fly:我刚刚在这里发布了一个用纯 PHP 用户区编写的 ZipStreamWriter 类:
https://github.com/cubiclesoft/php -zipstreamwriter
它不使用外部应用程序(例如 zip)或 ZipArchive 之类的扩展,而是通过实现成熟的 ZIP 编写器来支持将数据流进和流出类。
流处理方面的工作原理是使用 ZIP 文件格式的“数据描述符”,如 PKWARE ZIP 文件规范:
但需要注意一些可能的限制。并非每个工具都可以读取流式 ZIP 文件。此外,对 Zip64 流式 ZIP 文件的支持可能更少,但这仅适用于此类的超过 2GB 的文件。然而,7-Zip 和 Windows 10 内置 ZIP 文件阅读器似乎都能很好地处理 ZipStreamWriter 类向它们抛出的所有疯狂文件。我使用的十六进制编辑器也得到了很好的锻炼。
使用 ZipStreamWriter 类时,我建议在将缓冲区发送到 Web 服务器之前,一次允许缓冲区建立至少 4KB 但不超过 65KB。否则,对于许多非常小的文件,您将清除微小的零碎数据,并在 Apache 回调端浪费大量额外的 CPU 周期。
当某些东西不存在或者我不喜欢现有选项时,我会找到官方和非官方的规范,以及一些可以使用的示例,然后从头开始构建它。这是解决问题的一种相当可靠的方法,尽管有点矫枉过正。
I just released a ZipStreamWriter class written in pure PHP userland here:
https://github.com/cubiclesoft/php-zipstreamwriter
Instead of using external applications (e.g. zip) or extensions like ZipArchive, it supports streaming data into and out of the class by implementing a full-blown ZIP writer.
How the streaming aspect works is by using the ZIP file format's "Data Descriptors" as described by section 4.3.5 of the PKWARE ZIP file specification:
There are some possible limitations to be aware of though. Not every tool can read streaming ZIP files. Also, support for Zip64 streaming ZIP files may have even less support but that's only of concern for files over 2GB with this class. However, both 7-Zip and the Windows 10 built-in ZIP file reader seem to be fine with handling all of crazy files that the ZipStreamWriter class threw at them. The hex editor I use got a good workout too.
When using the ZipStreamWriter class, I recommend allowing a buffer to build up to at least 4KB but no more than 65KB at a time before sending it on to the web server. Otherwise, for lots of really tiny files, you'll be flushing out tiny bits of piecemeal data and waste a bunch of extra CPU cycles on the Apache callback end of things.
When something doesn't exist or I don't like the existing options, I find both official and unofficial specifications, some examples to work with, and then I build it from scratch. It's a fairly solid approach to problem solving, if just a tad overkill.