当前位置：文江博客话题详情

file_get_contents 脚本适用于某些网站，但不适用于其他网站

发布于 2024-11-27 06:40:34 字数 1121 浏览 0 评论 0 原文

我正在寻找构建一个 PHP 脚本来解析 HTML 中的特定标签。我一直在使用这个代码块，改编自此教程：

<?php 
$data = file_get_contents('http://www.google.com');
$regex = '/<title>(.+?)</';
preg_match($regex,$data,$match);
var_dump($match); 
echo $match[1];
?>

该脚本适用于某些网站（例如上面的 google），但是当我在其他网站（例如，freshdirect）上尝试时，我收到此错误：

“警告: file_get_contents(http://www.freshdirect.com) [function.file-get-contents]: 未能打开流：HTTP 请求失败！”

我见过很多很棒的 StackOverflow 上的建议，例如在 php.ini 中启用 extension=php_openssl.dll。但是 (1) 我的 php.ini 版本中没有 extension=php_openssl.dll ，并且 (2) 当我将其添加到扩展部分并重新启动 WAMP 服务器时，按照此 < a href="http://www.leoganda.net/how-to-enable-xampp-ssl-socket-transport/" rel="nofollow noreferrer">线程，仍然没有成功。

有人介意指出我正确的方向吗？非常感谢！

原文

I'm looking to build a PHP script that parses HTML for particular tags. I've been using this code block, adapted from this tutorial:

<?php 
$data = file_get_contents('http://www.google.com');
$regex = '/<title>(.+?)</';
preg_match($regex,$data,$match);
var_dump($match); 
echo $match[1];
?>

The script works with some websites (like google, above), but when I try it with other websites (like, say, freshdirect), I get this error:

"Warning: file_get_contents(http://www.freshdirect.com) [function.file-get-contents]: failed to open stream: HTTP request failed!"

I've seen a bunch of great suggestions on StackOverflow, for example to enable extension=php_openssl.dll in php.ini. But (1) my version of php.ini didn't have extension=php_openssl.dll in it, and (2) when I added it to the extensions section and restarted the WAMP server, per this thread, still no success.

Would someone mind pointing me in the right direction? Thank you very much!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

著墨染雨君画夕 2024-12-04 06:40:34

它只需要一个用户代理（实际上“任何”，任何字符串就足够了）：

file_get_contents("http://www.freshdirect.com",false,stream_context_create(
    array("http" => array("user_agent" => "any"))
));

请参阅更多选项。

当然，你可以设置 user_agent 在你的 ini 中：

 ini_set("user_agent","any");
 echo file_get_contents("http://www.freshdirect.com");

...但我更喜欢对下一个从事该工作的程序员明确说明。

It just requires a user-agent ("any" really, any string suffices):

file_get_contents("http://www.freshdirect.com",false,stream_context_create(
    array("http" => array("user_agent" => "any"))
));

See more options.

Of course, you can set user_agent in your ini:

 ini_set("user_agent","any");
 echo file_get_contents("http://www.freshdirect.com");

... but I prefer to be explicit for the next programmer working on it.

回复收藏 0 原文

ま柒月 2024-12-04 06:40:34

$html = file_get_html('http://google.com/');
$title = $html->find('title')->innertext;

或者，如果您更喜欢 preg_match 并且您应该真正使用 cURL 而不是 fgc...

function curl($url){

    $headers[]  = "User-Agent:Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13";
    $headers[]  = "Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
    $headers[]  = "Accept-Language:en-us,en;q=0.5";
    $headers[]  = "Accept-Encoding:gzip,deflate";
    $headers[]  = "Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.7";
    $headers[]  = "Keep-Alive:115";
    $headers[]  = "Connection:keep-alive";
    $headers[]  = "Cache-Control:max-age=0";

    $curl = curl_init();
    curl_setopt($curl, CURLOPT_URL, $url);
    curl_setopt($curl, CURLOPT_HTTPHEADER, $headers);
    curl_setopt($curl, CURLOPT_ENCODING, "gzip");
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
    $data = curl_exec($curl);
    curl_close($curl);
    return $data;

}


$data = curl('http://www.google.com');
$regex = '#<title>(.*?)</title>#mis';
preg_match($regex,$data,$match);
var_dump($match); 
echo $match[1];

$html = file_get_html('http://google.com/');
$title = $html->find('title')->innertext;

Or if you prefer with preg_match and you should be really using cURL instead of fgc...

function curl($url){

    $headers[]  = "User-Agent:Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13";
    $headers[]  = "Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
    $headers[]  = "Accept-Language:en-us,en;q=0.5";
    $headers[]  = "Accept-Encoding:gzip,deflate";
    $headers[]  = "Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.7";
    $headers[]  = "Keep-Alive:115";
    $headers[]  = "Connection:keep-alive";
    $headers[]  = "Cache-Control:max-age=0";

    $curl = curl_init();
    curl_setopt($curl, CURLOPT_URL, $url);
    curl_setopt($curl, CURLOPT_HTTPHEADER, $headers);
    curl_setopt($curl, CURLOPT_ENCODING, "gzip");
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
    $data = curl_exec($curl);
    curl_close($curl);
    return $data;

}


$data = curl('http://www.google.com');
$regex = '#<title>(.*?)</title>#mis';
preg_match($regex,$data,$match);
var_dump($match); 
echo $match[1];

回复收藏 0 原文

小情绪 2024-12-04 06:40:34

另一种选择：某些主机禁用 CURLOPT_FOLLOWLOCATION 因此递归就是您想要的，也会将任何错误记录到文本文件中。还有一个如何使用 DOMDocument() 提取内容的简单示例，显然它并不广泛，但您可以在其上构建应用程序。

<?php 
function file_get_site($url){
(function_exists('curl_init')) ? '' : die('cURL Must be installed. Ask your host to enable it or uncomment extension=php_curl.dll in php.ini');
$curl = curl_init();
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: ";

curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0 Firefox/5.0');
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
curl_setopt($curl, CURLOPT_HEADER, true);
curl_setopt($curl, CURLOPT_REFERER, $url);
curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate');
curl_setopt($curl, CURLOPT_AUTOREFERER, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_TIMEOUT, 60);

$html = curl_exec($curl);

$status = curl_getinfo($curl);
curl_close($curl);

if($status['http_code']!=200){
    if($status['http_code'] == 301 || $status['http_code'] == 302) {
        list($header) = explode("\r\n\r\n", $html, 2);
        $matches = array();
        preg_match("/(Location:|URI:)[^(\n)]*/", $header, $matches);
        $url = trim(str_replace($matches[1],"",$matches[0]));
        $url_parsed = parse_url($url);
        return (isset($url_parsed))? file_get_site($url):'';
    }
    $oline='';
    foreach($status as $key=>$eline){$oline.='['.$key.']'.$eline.' ';}
    $line =$oline." \r\n ".$url."\r\n-----------------\r\n";
    $handle = @fopen('./curl.error.log', 'a');
    fwrite($handle, $line);
    return FALSE;
}
return $html;
}


function get_content_tags($source,$tag,$id=null,$value=null){
    $xml = new DOMDocument();
    @$xml->loadHTML($source);

    foreach($xml->getElementsByTagName($tag) as $tags) {
        if($id!=null){
            if($tags->getAttribute($id)==$value){
                return $tags->getAttribute('content');
            }
        }
        return $tags->nodeValue;
    }
}


$source = file_get_site('http://www.freshdirect.com/about/index.jsp');

echo get_content_tags($source,'title'); //FreshDirect

echo get_content_tags($source,'meta','name','description'); //Online grocer providing high quality fresh......

?>

Another option: Some hosts disable CURLOPT_FOLLOWLOCATION so recursive is what you want, also will log into a text file any errors. Also a simple example of how to use DOMDocument() to extract the content, obviously its not extensive but something you could build appon.

<?php 
function file_get_site($url){
(function_exists('curl_init')) ? '' : die('cURL Must be installed. Ask your host to enable it or uncomment extension=php_curl.dll in php.ini');
$curl = curl_init();
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: ";

curl_setopt($curl, CURLOPT_URL, $url);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0 Firefox/5.0');
curl_setopt($curl, CURLOPT_HTTPHEADER, $header);
curl_setopt($curl, CURLOPT_HEADER, true);
curl_setopt($curl, CURLOPT_REFERER, $url);
curl_setopt($curl, CURLOPT_ENCODING, 'gzip,deflate');
curl_setopt($curl, CURLOPT_AUTOREFERER, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_TIMEOUT, 60);

$html = curl_exec($curl);

$status = curl_getinfo($curl);
curl_close($curl);

if($status['http_code']!=200){
    if($status['http_code'] == 301 || $status['http_code'] == 302) {
        list($header) = explode("\r\n\r\n", $html, 2);
        $matches = array();
        preg_match("/(Location:|URI:)[^(\n)]*/", $header, $matches);
        $url = trim(str_replace($matches[1],"",$matches[0]));
        $url_parsed = parse_url($url);
        return (isset($url_parsed))? file_get_site($url):'';
    }
    $oline='';
    foreach($status as $key=>$eline){$oline.='['.$key.']'.$eline.' ';}
    $line =$oline." \r\n ".$url."\r\n-----------------\r\n";
    $handle = @fopen('./curl.error.log', 'a');
    fwrite($handle, $line);
    return FALSE;
}
return $html;
}


function get_content_tags($source,$tag,$id=null,$value=null){
    $xml = new DOMDocument();
    @$xml->loadHTML($source);

    foreach($xml->getElementsByTagName($tag) as $tags) {
        if($id!=null){
            if($tags->getAttribute($id)==$value){
                return $tags->getAttribute('content');
            }
        }
        return $tags->nodeValue;
    }
}


$source = file_get_site('http://www.freshdirect.com/about/index.jsp');

echo get_content_tags($source,'title'); //FreshDirect

echo get_content_tags($source,'meta','name','description'); //Online grocer providing high quality fresh......

?>

回复收藏 0 原文

~没有更多了~

关于作者

做个ˇ局外人

暂无简介

0 文章

0 评论

23 人气

关注发私信

友情链接

文江博客

file_get_contents 脚本适用于某些网站，但不适用于其他网站

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

file_get_contents 脚本适用于某些网站，但不适用于其他网站

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（3）

关于作者

相关话题

热门标签

推荐作者

已经忘了多久

15867725375

LonelySnow

走过海棠暮

轻许诺言

信馬由缰

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。