使用 php-libcurl 跟踪页面标题和重定向
我正在编写一个脚本来跟踪标头,尤其是 URL 的重定向和 cookie。 很多时候,当我打开一个网址时,它会重定向到另一个网址,有时甚至会重定向到多个网址,并且还会存储一些 cookie。但是当我用 url 运行脚本时
我的脚本没有保存cookie,它只显示一个重定向并且没有存储任何cookies。但是当我在 Firefox 中浏览 url 时,它保存了 cookie,当我使用 Live HTTP Headers 检查它时,它显示了多个 get 请求。实时 HTTP 标头还显示存在 Set-Cookie 标头。
<?php
$url="http://en.wikipedia.org/";
$userAgent="Mozilla/5.0 (Windows NT 5.1; rv:2.0)Gecko/20100101 Firefox/4.0";
$accept="text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
$encoding="gzip, deflate";
$header['lang']="en-us,en;q=0.5";
$header['charset']="ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header['conn']="keep-alive";
$header['keep-alive']=115;
$i=1;
$flag=1; //0 if there is no redirect i.e. no location header to follow. used here to to control the while loop below
while($flag!=0) {
$ch=curl_init();
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_USERAGENT,$userAgent);
curl_setopt($ch,CURLOPT_ENCODING,$encoding);
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,0);
curl_setopt($ch,CURLOPT_HEADER,1);
curl_setopt($ch,CURLOPT_NOBODY,1);
curl_setopt($ch,CURLOPT_AUTOREFERER,true);
curl_setopt($ch, CURLOPT_COOKIEJAR, dirname(__FILE__) . "/cookie.txt");
curl_setopt($ch, CURLOPT_COOKIEFILE, dirname(__FILE__) . "/cookie.txt");
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
$pageHeader[$i]=curl_exec($ch);
curl_close($ch);
$flag=preg_match('/Location: (.*)\s/',$pageHeader[$i],$location[$i]);
if($flag==1) { //if there is a location header
if(preg_match('@^(http://|www.)@',$location[$i][1],$tempurl)==1) { //if it is an absolute url
$url=$location[$i][1];
} else {
if(preg_match('@^/(.*)@',$location[$i][1],$tempurl)==1) { //if the url corresponds to url relative to server's root
preg_match('@^((http://)|(www.))[^/]+@',$url,$domain);
$url=$domain.$tempurl[0];
} else { //if the url is relative to current directory
$url=preg_replace('@(/[^/]+)$@',"/".$location[$i][1],$url);
}
}
$location[$i]=$url;
preg_match('/Set-Cookie: (.*)\s/',$pageHeader[$i],$cookie[$i]);
$i++;
}
foreach($location as $l)
$loc=$loc.$l."\n";
$header=implode("\n\n\n",$pageHeader);
file_put_contents(dirname(__FILE__) . "/location.txt",$loc);
file_put_contents(dirname(__FILE__) . "/header.txt",$header);
?>
此处创建了文件 location.txt
和 header.txt
,但未创建 cookie.txt
。 如果我将网址更改为 google.com,那么它会在 location.txt
文件中显示到 google.co.in
的重定向,并在 中保存一个 cookie cookie.txt
文件。但是当我在 Firefox
中打开 google.com
时,它会保存三个 cookie。有什么问题吗? 我认为页面上有一些 javascript 正在设置 cookie,所以curl 无法获取它。 也欢迎对上述代码的改进提出任何建议
I was writing a script to track headers especially redirects and cookies for a url.
Many times when i open a url it redirects to another url or sometimes more than one url and also stores some cookies. But when i ran the script with url
my script didnt save cookies and it only showed one redirect and didnt store any cookies. but when i browsed the url in firefox it saved cookies and when i inspected it with Live HTTP Headers
it showed multiple get requests. Live HTTP Headers also shows that there are Set-Cookie headers.
<?php
$url="http://en.wikipedia.org/";
$userAgent="Mozilla/5.0 (Windows NT 5.1; rv:2.0)Gecko/20100101 Firefox/4.0";
$accept="text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
$encoding="gzip, deflate";
$header['lang']="en-us,en;q=0.5";
$header['charset']="ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header['conn']="keep-alive";
$header['keep-alive']=115;
$i=1;
$flag=1; //0 if there is no redirect i.e. no location header to follow. used here to to control the while loop below
while($flag!=0) {
$ch=curl_init();
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_USERAGENT,$userAgent);
curl_setopt($ch,CURLOPT_ENCODING,$encoding);
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,0);
curl_setopt($ch,CURLOPT_HEADER,1);
curl_setopt($ch,CURLOPT_NOBODY,1);
curl_setopt($ch,CURLOPT_AUTOREFERER,true);
curl_setopt($ch, CURLOPT_COOKIEJAR, dirname(__FILE__) . "/cookie.txt");
curl_setopt($ch, CURLOPT_COOKIEFILE, dirname(__FILE__) . "/cookie.txt");
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
$pageHeader[$i]=curl_exec($ch);
curl_close($ch);
$flag=preg_match('/Location: (.*)\s/',$pageHeader[$i],$location[$i]);
if($flag==1) { //if there is a location header
if(preg_match('@^(http://|www.)@',$location[$i][1],$tempurl)==1) { //if it is an absolute url
$url=$location[$i][1];
} else {
if(preg_match('@^/(.*)@',$location[$i][1],$tempurl)==1) { //if the url corresponds to url relative to server's root
preg_match('@^((http://)|(www.))[^/]+@',$url,$domain);
$url=$domain.$tempurl[0];
} else { //if the url is relative to current directory
$url=preg_replace('@(/[^/]+)$@',"/".$location[$i][1],$url);
}
}
$location[$i]=$url;
preg_match('/Set-Cookie: (.*)\s/',$pageHeader[$i],$cookie[$i]);
$i++;
}
foreach($location as $l)
$loc=$loc.$l."\n";
$header=implode("\n\n\n",$pageHeader);
file_put_contents(dirname(__FILE__) . "/location.txt",$loc);
file_put_contents(dirname(__FILE__) . "/header.txt",$header);
?>
here the file location.txt
and header.txt
are created but cookie.txt
are not created.
if i change the url to google.com then it shows the redirect to google.co.in
in the location.txt
file and it saves a cookie in the cookie.txt
file. But when i open google.com
in Firefox
it saves three cookies. What can be wrong?
I think there is some javascript on the page that is setting the cookies so curl is not able to get that.
also any suggestions for the improvement of above code are welcome
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您的位置:以下代码完全损坏,因为您应该已经看到大多数相对的 HTTP 重定向,因此您不能仅在后续请求中使用该字符串作为 URL。
Your Location: following code is completely broken, as you should've seen most HTTP redirects relative and thus you can't just use that string as a URL in the subsequent request.