PHP 用 simple_html_dom 解析,请检查

发布于 2024-10-15 18:14:56 字数 1384 浏览 3 评论 0原文

我制作了一个简单的解析器,用于使用简单的 html dom 保存每页的所有图像并获取图像类,但我必须在循环内创建一个循环才能逐页传递,我认为我的代码中有些东西没有优化,因为它是非常慢并且总是超时或内存超出。有人可以快速浏览一下代码,也许你会发现我做了一些非常愚蠢的事情?

这是不包含库的代码...

$pageNumbers = array(); //Array to hold number of pages to parse

$url = 'http://sitename/category/'; //target url
$html = file_get_html($url);


//Simply detecting the paginator class and pushing into an array to find out how many pages to parse placing it into an array
foreach($html->find('td.nav .str') as $pn){
    array_push($pageNumbers, $pn->innertext);               
}

// initializing the get image class
$image = new GetImage;
$image->save_to = $pfolder.'/'; // save to folder, value from post request.

//Start reading pages array and parsing all images per page.
foreach($pageNumbers as $ppp){

    $target_url = 'http://sitename.com/category/'.$ppp; //Here i construct a page from an array to parse.
    $target_html = file_get_html($target_url); //Reading the page html to find all images inside next.

    //Final loop to find and save each image per page.
    foreach($target_html->find('img.clipart') as $element) {
        $image->source = url_to_absolute($target_url, $element->src);
        $get = $image->download('curl'); // using GD
        echo 'saved'.url_to_absolute($target_url, $element->src).'<br />';           
    }

}

谢谢。

I made a simple parser for saving all images per page with simple html dom and get image class but i had to make a loop inside the loop in order to pass page by page and i think something is just not optimized in my code as it is very slow and always timeouts or memory exceeds. Could someone just have a quick look at the code and maybe you see something really stupid that i made?

Here is the code without libraries included...

$pageNumbers = array(); //Array to hold number of pages to parse

$url = 'http://sitename/category/'; //target url
$html = file_get_html($url);


//Simply detecting the paginator class and pushing into an array to find out how many pages to parse placing it into an array
foreach($html->find('td.nav .str') as $pn){
    array_push($pageNumbers, $pn->innertext);               
}

// initializing the get image class
$image = new GetImage;
$image->save_to = $pfolder.'/'; // save to folder, value from post request.

//Start reading pages array and parsing all images per page.
foreach($pageNumbers as $ppp){

    $target_url = 'http://sitename.com/category/'.$ppp; //Here i construct a page from an array to parse.
    $target_html = file_get_html($target_url); //Reading the page html to find all images inside next.

    //Final loop to find and save each image per page.
    foreach($target_html->find('img.clipart') as $element) {
        $image->source = url_to_absolute($target_url, $element->src);
        $get = $image->download('curl'); // using GD
        echo 'saved'.url_to_absolute($target_url, $element->src).'<br />';           
    }

}

Thank you.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

不再让梦枯萎 2024-10-22 18:14:56

您在这里做了很多事情,我对脚本超时并不感到惊讶。您下载多个网页,解析它们,在其中查找图像,然后下载这些图像......有多少页,每页有多少张图像?除非我们谈论的数字非常小,否则这是可以预料的。

鉴于此,我不确定你的问题到底是什么,但我假设它是“我如何使这项工作有效?”。你有几个选择,这实际上取决于它的用途。如果这是一次一次性的黑客攻击,可以抓取一些网站,增加内存和时间限制,也许可以将工作分块来做一点,下次将其写成更合适的东西;)

如果这是发生在服务器端的事情,它可能应该与用户交互异步发生 - 即,而不是用户请求某个页面,该页面必须在返回之前完成所有这些操作,这应该在后台发生。它甚至不必是 PHP,您可以有一个以任何语言运行的脚本,该脚本可以传递要抓取的内容并执行它。

You are doing quite a lot here, I'm not surprised the script times out. You download multiple web pages, parse them, find images in them, and then download those images... how many pages, and how many images per page? Unless we're talking very small numbers then this is to be expected.

I'm not sure what your question really is, given that, but I'm assuming it's "how do I make this work?". You have a few options, it really depends what this is for. If it's a one-off hack to scrape some sites, ramp up the memory and time limits, maybe chunk up the work to do a little, and next time write it in something more suitable ;)

If this is something that happens server-side, it should probably be happening asynchronously to user interaction - i.e. rather than the user requesting some page, which has to do all this before returning, this should happen in the background. It wouldn't even have to be PHP, you could have a script running in any language that gets passed things to scrape and does it.

┾廆蒐ゝ 2024-10-22 18:14:56

我建议创建一个函数来执行实际的简单 html dom 处理。
我通常使用以下“模板”...注意“清除内存”部分。
显然 PHP 5 中存在内存泄漏......至少我在某个地方读到过。

function scraping_page($iUrl)
{
    // create HTML DOM
    $html = file_get_html($iUrl);

    // get text elements
    $aObj = $html->find('img');

    // do something with the element objects

    // clean up memory (prevent memory leaks in PHP 5)
    $html->clear();  // **** very important ****
    unset($html);    // **** very important ****

    return;  // also can return something: array, string, whatever
}

希望有帮助。

I suggest making a function to do the actual simple html dom processing.
I usually use the following 'template'... note the 'clear memory' section.
Apparently there is a memory leak in PHP 5... at least I read that someplace.

function scraping_page($iUrl)
{
    // create HTML DOM
    $html = file_get_html($iUrl);

    // get text elements
    $aObj = $html->find('img');

    // do something with the element objects

    // clean up memory (prevent memory leaks in PHP 5)
    $html->clear();  // **** very important ****
    unset($html);    // **** very important ****

    return;  // also can return something: array, string, whatever
}

Hope that helps.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文