检索部分网页

发布于 2024-08-06 15:04:43 字数 467 浏览 2 评论 0原文

有什么方法可以限制 CURL 获取的数据量吗?我正在屏幕上从 50kb 的页面上抓取数据,但是我需要的数据位于页面的前 1/4 部分,因此我实际上只需要检索页面的前 10kb。

我之所以这么问,是因为我需要监控大量数据,这导致我每月传输近 60GB 的数据,而实际上只有大约 5GB 的带宽是相关的。

我正在使用 PHP 来处理数据,但是我的数据检索方法很灵活,我可以使用 CURL、WGET、fopen 等。

我正在考虑的一种方法是

$fp = fopen("http://www.website.com","r");
fseek($fp,5000);
$data_to_parse = fread($fp,6000);

上述是否意味着我只会从 www.website 传输 6kb。 com,或者 fopen 会将 www.website.com 加载到内存中,这意味着我仍然会传输完整的 50kb?

Is there any way of limiting the amount of data CURL will fetch? I'm screen scraping data off a page that is 50kb, however the data I require is in the top 1/4 of the page so I really only need to retrieve the first 10kb of the page.

I'm asking because there is a lot of data I need to monitor which results in me transferring close to 60GB of data per month, when only about 5GB of this bandwidth is relevant.

I am using PHP to process the data, however I am flexible in my data retrieval approach, I can use CURL, WGET, fopen etc.

One approach I'm considering is

$fp = fopen("http://www.website.com","r");
fseek($fp,5000);
$data_to_parse = fread($fp,6000);

Does the above mean I will only transfer 6kb from www.website.com, or will fopen load www.website.com into memory meaning I will still transfer the full 50kb?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

梦魇绽荼蘼 2024-08-13 15:04:43

事实上,这更像是一个 HTTP 问题,而不是一个 CURL 问题。

正如您所猜测的,如果您使用 fopen,将会下载整个页面。不管你是否在偏移量 5000 处寻找。

实现您想要的效果的最佳方法是使用部分 HTTP GET 请求,如 HTML RFC (http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html):

GET 方法的语义发生变化
如果请求为“部分 GET”
消息包含 Range 头字段。
部分 GET 请求仅部分
实体的转移,如
第 14.35 节中描述。这
部分 GET 方法的目的是
减少不必要的网络使用
允许部分检索实体
无需转移即可完成
客户端已持有的数据。

使用 Ranges 的部分 GET 请求的详细信息如下所述:
http://www.w3.org/Protocols/rfc2616/ rfc2616-sec14.html#sec14.35.2

This is more an HTTP that a CURL question in fact.

As you guessed, the whole page is going to be downloaded if you use fopen. No matter then if you seek at offset 5000 or not.

The best way to achieve what you want would be to use a partial HTTP GET request, as stated in HTML RFC (http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html):

The semantics of the GET method change
to a "partial GET" if the request
message includes a Range header field.
A partial GET requests that only part
of the entity be transferred, as
described in section 14.35. The
partial GET method is intended to
reduce unnecessary network usage by
allowing partially-retrieved entities
to be completed without transferring
data already held by the client.

The details of partial GET requests using Ranges is described here:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35.2

征棹 2024-08-13 15:04:43

try a HTTP RANGE request:

GET /largefile.html HTTP/1.1
Range: bytes=0-6000

if the server supports range requests, it will return a 206 Partial Content response code with a Content-Range header and your requested range of bytes (if it doesn't, it will return 200 and the whole file). see http://benramsey.com/archives/206-partial-content-and-range-requests/ for a nice explanation of range requests.

see also Resumable downloads when using PHP to send the file?.

幻梦 2024-08-13 15:04:43

您也许也可以使用 CURL 来完成您正在寻找的任务。

如果您查看 CURLOPT_WRITEFUNCTION 的文档,您可以注册每当数据可用于从 CURL 读取时调用的回调。然后,您可以计算收到的字节数,当您收到超过 6,000 个字节时,您可以返回 0 以中止剩余的传输。

libcurl 文档对回调进行了更多描述:

一旦收到需要处理的数据,libcurl 就会调用该函数
已保存。返回字节数
实际上得到了照顾。如果那个数额
与传递给您的金额不同
函数,它会向
库,它将中止传输
并返回 CURLE_WRITE_ERROR。

回调函数将被传递
尽可能多的数据
调用,但你不可能使
任何假设。可能是一个字节,
可能有数千个。

You may be able to also accomplish what you're looking for using CURL as well.

If you look at the documentation for CURLOPT_WRITEFUNCTION you can register a callback that is called whenever data is available for reading from CURL. You could then count the bytes received, and when you've received over 6,000 bytes you can return 0 to abort the rest of the transfer.

The libcurl documentation describes the callback a bit more:

This function gets called by libcurl as soon as there is data received that needs to be
saved. Return the number of bytes
actually taken care of. If that amount
differs from the amount passed to your
function, it'll signal an error to the
library and it will abort the transfer
and return CURLE_WRITE_ERROR.

The callback function will be passed
as much data as possible in all
invokes, but you cannot possibly make
any assumptions. It may be one byte,
it may be thousands.

雄赳赳气昂昂 2024-08-13 15:04:43

它将使用 fopen 调用下载整个页面,但随后它只会从该页面读取 6kb。

来自 PHP 手册:

一旦满足以下条件之一,读取就会停止:

  • 已读取长度个字节

It will download the whole page with the fopen call, but then it will only read 6kb from that page.

From the PHP manual:

Reading stops as soon as one of the following conditions is met:

  • length bytes have been read
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文