使用 php-libcurl 跟踪页面标题和重定向

发布于 2024-11-08 19:06:56 字数 3015 浏览 0 评论 0原文

我正在编写一个脚本来跟踪标头,尤其是 URL 的重定向和 cookie。 很多时候,当我打开一个网址时,它会重定向到另一个网址,有时甚至会重定向到多个网址,并且还会存储一些 cookie。但是当我用 url 运行脚本时

http://en.wikipedia.org/

我的脚本没有保存cookie,它只显示一个重定向并且没有存储任何cookies。但是当我在 Firefox 中浏览 url 时,它保存了 cookie,当我使用 Live HTTP Headers 检查它时,它显示了多个 get 请求。实时 HTTP 标头还显示存在 Set-Cookie 标头。

<?php

$url="http://en.wikipedia.org/";
$userAgent="Mozilla/5.0 (Windows NT 5.1; rv:2.0)Gecko/20100101 Firefox/4.0";
$accept="text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
$encoding="gzip, deflate";
$header['lang']="en-us,en;q=0.5";
$header['charset']="ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header['conn']="keep-alive";
$header['keep-alive']=115;
$i=1;
$flag=1;        //0 if there is no redirect i.e. no location header to follow. used here to to control the while loop below

while($flag!=0) {
    $ch=curl_init();
    curl_setopt($ch,CURLOPT_URL,$url);
    curl_setopt($ch,CURLOPT_USERAGENT,$userAgent);
    curl_setopt($ch,CURLOPT_ENCODING,$encoding);
    curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
    curl_setopt($ch,CURLOPT_FOLLOWLOCATION,0);
    curl_setopt($ch,CURLOPT_HEADER,1);
    curl_setopt($ch,CURLOPT_NOBODY,1);
    curl_setopt($ch,CURLOPT_AUTOREFERER,true);
    curl_setopt($ch, CURLOPT_COOKIEJAR, dirname(__FILE__) . "/cookie.txt");
    curl_setopt($ch, CURLOPT_COOKIEFILE, dirname(__FILE__) . "/cookie.txt");
    curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
    $pageHeader[$i]=curl_exec($ch);
    curl_close($ch);
    $flag=preg_match('/Location: (.*)\s/',$pageHeader[$i],$location[$i]);
    if($flag==1) {      //if there is a location header    
        if(preg_match('@^(http://|www.)@',$location[$i][1],$tempurl)==1) {      //if it is an absolute url
            $url=$location[$i][1];
        } else {
            if(preg_match('@^/(.*)@',$location[$i][1],$tempurl)==1) {   //if the url corresponds to url relative to server's root
                preg_match('@^((http://)|(www.))[^/]+@',$url,$domain);
                $url=$domain.$tempurl[0];
            } else {        //if the url is relative to current directory
                $url=preg_replace('@(/[^/]+)$@',"/".$location[$i][1],$url);
            }
        }
        $location[$i]=$url;
        preg_match('/Set-Cookie: (.*)\s/',$pageHeader[$i],$cookie[$i]);
        $i++;
    }

    foreach($location as $l)
        $loc=$loc.$l."\n";

    $header=implode("\n\n\n",$pageHeader);
    file_put_contents(dirname(__FILE__) . "/location.txt",$loc);
    file_put_contents(dirname(__FILE__) . "/header.txt",$header);
?>

此处创建了文件 location.txtheader.txt,但未创建 cookie.txt。 如果我将网址更改为 google.com,那么它会在 location.txt 文件中显示到 google.co.in 的重定向,并在 中保存一个 cookie cookie.txt 文件。但是当我在 Firefox 中打开 google.com 时,它会保存三个 cookie。有什么问题吗? 我认为页面上有一些 javascript 正在设置 cookie,所以curl 无法获取它。 也欢迎对上述代码的改进提出任何建议

I was writing a script to track headers especially redirects and cookies for a url.
Many times when i open a url it redirects to another url or sometimes more than one url and also stores some cookies. But when i ran the script with url

http://en.wikipedia.org/

my script didnt save cookies and it only showed one redirect and didnt store any cookies. but when i browsed the url in firefox it saved cookies and when i inspected it with Live HTTP Headers it showed multiple get requests. Live HTTP Headers also shows that there are Set-Cookie headers.

<?php

$url="http://en.wikipedia.org/";
$userAgent="Mozilla/5.0 (Windows NT 5.1; rv:2.0)Gecko/20100101 Firefox/4.0";
$accept="text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
$encoding="gzip, deflate";
$header['lang']="en-us,en;q=0.5";
$header['charset']="ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header['conn']="keep-alive";
$header['keep-alive']=115;
$i=1;
$flag=1;        //0 if there is no redirect i.e. no location header to follow. used here to to control the while loop below

while($flag!=0) {
    $ch=curl_init();
    curl_setopt($ch,CURLOPT_URL,$url);
    curl_setopt($ch,CURLOPT_USERAGENT,$userAgent);
    curl_setopt($ch,CURLOPT_ENCODING,$encoding);
    curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
    curl_setopt($ch,CURLOPT_FOLLOWLOCATION,0);
    curl_setopt($ch,CURLOPT_HEADER,1);
    curl_setopt($ch,CURLOPT_NOBODY,1);
    curl_setopt($ch,CURLOPT_AUTOREFERER,true);
    curl_setopt($ch, CURLOPT_COOKIEJAR, dirname(__FILE__) . "/cookie.txt");
    curl_setopt($ch, CURLOPT_COOKIEFILE, dirname(__FILE__) . "/cookie.txt");
    curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
    $pageHeader[$i]=curl_exec($ch);
    curl_close($ch);
    $flag=preg_match('/Location: (.*)\s/',$pageHeader[$i],$location[$i]);
    if($flag==1) {      //if there is a location header    
        if(preg_match('@^(http://|www.)@',$location[$i][1],$tempurl)==1) {      //if it is an absolute url
            $url=$location[$i][1];
        } else {
            if(preg_match('@^/(.*)@',$location[$i][1],$tempurl)==1) {   //if the url corresponds to url relative to server's root
                preg_match('@^((http://)|(www.))[^/]+@',$url,$domain);
                $url=$domain.$tempurl[0];
            } else {        //if the url is relative to current directory
                $url=preg_replace('@(/[^/]+)$@',"/".$location[$i][1],$url);
            }
        }
        $location[$i]=$url;
        preg_match('/Set-Cookie: (.*)\s/',$pageHeader[$i],$cookie[$i]);
        $i++;
    }

    foreach($location as $l)
        $loc=$loc.$l."\n";

    $header=implode("\n\n\n",$pageHeader);
    file_put_contents(dirname(__FILE__) . "/location.txt",$loc);
    file_put_contents(dirname(__FILE__) . "/header.txt",$header);
?>

here the file location.txt and header.txt are created but cookie.txt are not created.
if i change the url to google.com then it shows the redirect to google.co.in in the location.txt file and it saves a cookie in the cookie.txt file. But when i open google.com in Firefox it saves three cookies. What can be wrong?
I think there is some javascript on the page that is setting the cookies so curl is not able to get that.
also any suggestions for the improvement of above code are welcome

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

记忆里有你的影子 2024-11-15 19:06:56

您的位置:以下代码完全损坏,因为您应该已经看到大多数相对的 HTTP 重定向,因此您不能仅在后续请求中使用该字符串作为 URL。

Your Location: following code is completely broken, as you should've seen most HTTP redirects relative and thus you can't just use that string as a URL in the subsequent request.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文