检索部分网页
有什么方法可以限制 CURL 获取的数据量吗?我正在屏幕上从 50kb 的页面上抓取数据,但是我需要的数据位于页面的前 1/4 部分,因此我实际上只需要检索页面的前 10kb。
我之所以这么问,是因为我需要监控大量数据,这导致我每月传输近 60GB 的数据,而实际上只有大约 5GB 的带宽是相关的。
我正在使用 PHP 来处理数据,但是我的数据检索方法很灵活,我可以使用 CURL、WGET、fopen 等。
我正在考虑的一种方法是
$fp = fopen("http://www.website.com","r");
fseek($fp,5000);
$data_to_parse = fread($fp,6000);
上述是否意味着我只会从 www.website 传输 6kb。 com,或者 fopen 会将 www.website.com 加载到内存中,这意味着我仍然会传输完整的 50kb?
Is there any way of limiting the amount of data CURL will fetch? I'm screen scraping data off a page that is 50kb, however the data I require is in the top 1/4 of the page so I really only need to retrieve the first 10kb of the page.
I'm asking because there is a lot of data I need to monitor which results in me transferring close to 60GB of data per month, when only about 5GB of this bandwidth is relevant.
I am using PHP to process the data, however I am flexible in my data retrieval approach, I can use CURL, WGET, fopen etc.
One approach I'm considering is
$fp = fopen("http://www.website.com","r");
fseek($fp,5000);
$data_to_parse = fread($fp,6000);
Does the above mean I will only transfer 6kb from www.website.com, or will fopen load www.website.com into memory meaning I will still transfer the full 50kb?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
事实上,这更像是一个 HTTP 问题,而不是一个 CURL 问题。
正如您所猜测的,如果您使用 fopen,将会下载整个页面。不管你是否在偏移量 5000 处寻找。
实现您想要的效果的最佳方法是使用部分 HTTP GET 请求,如 HTML RFC (http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html):
使用 Ranges 的部分 GET 请求的详细信息如下所述:
http://www.w3.org/Protocols/rfc2616/ rfc2616-sec14.html#sec14.35.2
This is more an HTTP that a CURL question in fact.
As you guessed, the whole page is going to be downloaded if you use fopen. No matter then if you seek at offset 5000 or not.
The best way to achieve what you want would be to use a partial HTTP GET request, as stated in HTML RFC (http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html):
The details of partial GET requests using Ranges is described here:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35.2
尝试 HTTP RANGE 请求:
如果服务器支持范围请求,它将返回 206 Partial Content 响应代码,其中包含 Content-Range 标头和您请求的字节范围(如果没有,它将返回 200 和整个文件)。请参阅http://benramsey.com/archives/206-partial- content-and-range-requests/ 对范围请求有很好的解释。
另请参阅使用 PHP 发送文件时可恢复下载?。
try a HTTP RANGE request:
if the server supports range requests, it will return a 206 Partial Content response code with a Content-Range header and your requested range of bytes (if it doesn't, it will return 200 and the whole file). see http://benramsey.com/archives/206-partial-content-and-range-requests/ for a nice explanation of range requests.
see also Resumable downloads when using PHP to send the file?.
您也许也可以使用 CURL 来完成您正在寻找的任务。
如果您查看 CURLOPT_WRITEFUNCTION 的文档,您可以注册每当数据可用于从 CURL 读取时调用的回调。然后,您可以计算收到的字节数,当您收到超过 6,000 个字节时,您可以返回 0 以中止剩余的传输。
libcurl 文档对回调进行了更多描述:
You may be able to also accomplish what you're looking for using CURL as well.
If you look at the documentation for CURLOPT_WRITEFUNCTION you can register a callback that is called whenever data is available for reading from CURL. You could then count the bytes received, and when you've received over 6,000 bytes you can return 0 to abort the rest of the transfer.
The libcurl documentation describes the callback a bit more:
它将使用
fopen
调用下载整个页面,但随后它只会从该页面读取 6kb。来自 PHP 手册:
It will download the whole page with the
fopen
call, but then it will only read 6kb from that page.From the PHP manual: