PHP 用 simple_html_dom 解析，请检查

发布于 2024-10-15 18:14:56 字数 1384 浏览 3 评论 0原文

我制作了一个简单的解析器，用于使用简单的 html dom 保存每页的所有图像并获取图像类，但我必须在循环内创建一个循环才能逐页传递，我认为我的代码中有些东西没有优化，因为它是非常慢并且总是超时或内存超出。有人可以快速浏览一下代码，也许你会发现我做了一些非常愚蠢的事情？

这是不包含库的代码...

$pageNumbers = array(); //Array to hold number of pages to parse

$url = 'http://sitename/category/'; //target url
$html = file_get_html($url);


//Simply detecting the paginator class and pushing into an array to find out how many pages to parse placing it into an array
foreach($html->find('td.nav .str') as $pn){
    array_push($pageNumbers, $pn->innertext);               
}

// initializing the get image class
$image = new GetImage;
$image->save_to = $pfolder.'/'; // save to folder, value from post request.

//Start reading pages array and parsing all images per page.
foreach($pageNumbers as $ppp){

    $target_url = 'http://sitename.com/category/'.$ppp; //Here i construct a page from an array to parse.
    $target_html = file_get_html($target_url); //Reading the page html to find all images inside next.

    //Final loop to find and save each image per page.
    foreach($target_html->find('img.clipart') as $element) {
        $image->source = url_to_absolute($target_url, $element->src);
        $get = $image->download('curl'); // using GD
        echo 'saved'.url_to_absolute($target_url, $element->src).'<br />';           
    }

}

谢谢。

原文

I made a simple parser for saving all images per page with simple html dom and get image class but i had to make a loop inside the loop in order to pass page by page and i think something is just not optimized in my code as it is very slow and always timeouts or memory exceeds. Could someone just have a quick look at the code and maybe you see something really stupid that i made?

Here is the code without libraries included...

$pageNumbers = array(); //Array to hold number of pages to parse

$url = 'http://sitename/category/'; //target url
$html = file_get_html($url);


//Simply detecting the paginator class and pushing into an array to find out how many pages to parse placing it into an array
foreach($html->find('td.nav .str') as $pn){
    array_push($pageNumbers, $pn->innertext);               
}

// initializing the get image class
$image = new GetImage;
$image->save_to = $pfolder.'/'; // save to folder, value from post request.

//Start reading pages array and parsing all images per page.
foreach($pageNumbers as $ppp){

    $target_url = 'http://sitename.com/category/'.$ppp; //Here i construct a page from an array to parse.
    $target_html = file_get_html($target_url); //Reading the page html to find all images inside next.

    //Final loop to find and save each image per page.
    foreach($target_html->find('img.clipart') as $element) {
        $image->source = url_to_absolute($target_url, $element->src);
        $get = $image->download('curl'); // using GD
        echo 'saved'.url_to_absolute($target_url, $element->src).'<br />';           
    }

}

Thank you.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

不再让梦枯萎 2024-10-22 18:14:56

您在这里做了很多事情，我对脚本超时并不感到惊讶。您下载多个网页，解析它们，在其中查找图像，然后下载这些图像......有多少页，每页有多少张图像？除非我们谈论的数字非常小，否则这是可以预料的。

鉴于此，我不确定你的问题到底是什么，但我假设它是“我如何使这项工作有效？”。你有几个选择，这实际上取决于它的用途。如果这是一次一次性的黑客攻击，可以抓取一些网站，增加内存和时间限制，也许可以将工作分块来做一点，下次将其写成更合适的东西；）

如果这是发生在服务器端的事情，它可能应该与用户交互异步发生 - 即，而不是用户请求某个页面，该页面必须在返回之前完成所有这些操作，这应该在后台发生。它甚至不必是 PHP，您可以有一个以任何语言运行的脚本，该脚本可以传递要抓取的内容并执行它。

回复收藏 0 原文

┾廆蒐ゝ 2024-10-22 18:14:56

我建议创建一个函数来执行实际的简单 html dom 处理。
我通常使用以下“模板”...注意“清除内存”部分。
显然 PHP 5 中存在内存泄漏......至少我在某个地方读到过。

function scraping_page($iUrl)
{
    // create HTML DOM
    $html = file_get_html($iUrl);

    // get text elements
    $aObj = $html->find('img');

    // do something with the element objects

    // clean up memory (prevent memory leaks in PHP 5)
    $html->clear();  // **** very important ****
    unset($html);    // **** very important ****

    return;  // also can return something: array, string, whatever
}

希望有帮助。

I suggest making a function to do the actual simple html dom processing.
I usually use the following 'template'... note the 'clear memory' section.
Apparently there is a memory leak in PHP 5... at least I read that someplace.

function scraping_page($iUrl)
{
    // create HTML DOM
    $html = file_get_html($iUrl);

    // get text elements
    $aObj = $html->find('img');

    // do something with the element objects

    // clean up memory (prevent memory leaks in PHP 5)
    $html->clear();  // **** very important ****
    unset($html);    // **** very important ****

    return;  // also can return something: array, string, whatever
}

Hope that helps.

回复收藏 0 原文

~没有更多了~