检索部分网页

发布于 2024-08-06 15:04:43 字数 467 浏览 2 评论 0原文

有什么方法可以限制 CURL 获取的数据量吗？我正在屏幕上从 50kb 的页面上抓取数据，但是我需要的数据位于页面的前 1/4 部分，因此我实际上只需要检索页面的前 10kb。

我之所以这么问，是因为我需要监控大量数据，这导致我每月传输近 60GB 的数据，而实际上只有大约 5GB 的带宽是相关的。

我正在使用 PHP 来处理数据，但是我的数据检索方法很灵活，我可以使用 CURL、WGET、fopen 等。

我正在考虑的一种方法是

$fp = fopen("http://www.website.com","r");
fseek($fp,5000);
$data_to_parse = fread($fp,6000);

上述是否意味着我只会从 www.website 传输 6kb。 com，或者 fopen 会将 www.website.com 加载到内存中，这意味着我仍然会传输完整的 50kb？

原文

Is there any way of limiting the amount of data CURL will fetch? I'm screen scraping data off a page that is 50kb, however the data I require is in the top 1/4 of the page so I really only need to retrieve the first 10kb of the page.

I'm asking because there is a lot of data I need to monitor which results in me transferring close to 60GB of data per month, when only about 5GB of this bandwidth is relevant.

I am using PHP to process the data, however I am flexible in my data retrieval approach, I can use CURL, WGET, fopen etc.

One approach I'm considering is

$fp = fopen("http://www.website.com","r");
fseek($fp,5000);
$data_to_parse = fread($fp,6000);

Does the above mean I will only transfer 6kb from www.website.com, or will fopen load www.website.com into memory meaning I will still transfer the full 50kb?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦魇绽荼蘼 2024-08-13 15:04:43

事实上，这更像是一个 HTTP 问题，而不是一个 CURL 问题。

正如您所猜测的，如果您使用 fopen，将会下载整个页面。不管你是否在偏移量 5000 处寻找。

实现您想要的效果的最佳方法是使用部分 HTTP GET 请求，如 HTML RFC (http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html）：

GET 方法的语义发生变化
如果请求为“部分 GET”
消息包含 Range 头字段。
部分 GET 请求仅部分
实体的转移，如
第 14.35 节中描述。这
部分 GET 方法的目的是
减少不必要的网络使用
允许部分检索实体
无需转移即可完成
客户端已持有的数据。

使用 Ranges 的部分 GET 请求的详细信息如下所述：
http://www.w3.org/Protocols/rfc2616/ rfc2616-sec14.html#sec14.35.2

回复收藏 0 原文

征棹 2024-08-13 15:04:43

尝试 HTTP RANGE 请求：

GET /largefile.html HTTP/1.1
Range: bytes=0-6000

如果服务器支持范围请求，它将返回 206 Partial Content 响应代码，其中包含 Content-Range 标头和您请求的字节范围（如果没有，它将返回 200 和整个文件）。请参阅 http://benramsey.com/archives/206-partial- content-and-range-requests/ 对范围请求有很好的解释。

另请参阅使用 PHP 发送文件时可恢复下载？。

try a HTTP RANGE request:

GET /largefile.html HTTP/1.1
Range: bytes=0-6000

if the server supports range requests, it will return a 206 Partial Content response code with a Content-Range header and your requested range of bytes (if it doesn't, it will return 200 and the whole file). see http://benramsey.com/archives/206-partial-content-and-range-requests/ for a nice explanation of range requests.

回复收藏 0 原文

幻梦 2024-08-13 15:04:43

您也许也可以使用 CURL 来完成您正在寻找的任务。

如果您查看 CURLOPT_WRITEFUNCTION 的文档，您可以注册每当数据可用于从 CURL 读取时调用的回调。然后，您可以计算收到的字节数，当您收到超过 6,000 个字节时，您可以返回 0 以中止剩余的传输。

libcurl 文档对回调进行了更多描述：

一旦收到需要处理的数据，libcurl 就会调用该函数
已保存。返回字节数
实际上得到了照顾。如果那个数额
与传递给您的金额不同
函数，它会向
库，它将中止传输
并返回 CURLE_WRITE_ERROR。
回调函数将被传递
尽可能多的数据
调用，但你不可能使
任何假设。可能是一个字节，
可能有数千个。