使用正则表达式和 simplehtmldom 抓取数据

发布于 2024-11-29 07:42:46 字数 676 浏览 1 评论 0原文

我正在尝试从该网站抓取一些数据: http://laperuanavegana.wordpress.com/ 。实际上我想要食谱和成分的标题。成分位于两个特定的关键字里面。我正在尝试使用 regex 和 simplehtmldom 获取此数据。但它显示完整的 html 文本而不仅仅是成分。这是我的代码:

include_once('simple_html_dom.php');
$base_url = "http://laperuanavegana.wordpress.com/";

traverse($base_url);


function traverse($base_url)
{
    
    $html = file_get_html($base_url);
    $k1="Ingredientes";
    $k2="Preparación";
    preg_match_all("/$k1(.*)$k2/s",$html->innertext,$out);
    echo $out[0][0];
}

?>

此页面中有多种成分。我想要所有的。所以使用 preg_match_all() 如果有人发现这段代码的错误,这将会很有帮助。 提前致谢。

i am trying to scrape some data from this site : http://laperuanavegana.wordpress.com/ . actually i want the title of recipe and ingredients . ingredients is located inside two specific keyword . i am trying to get this data using regex and simplehtmldom . but its showing the full html text not just the ingredients . here is my code :
<?php

include_once('simple_html_dom.php');
$base_url = "http://laperuanavegana.wordpress.com/";

traverse($base_url);


function traverse($base_url)
{
    
    $html = file_get_html($base_url);
    $k1="Ingredientes";
    $k2="Preparación";
    preg_match_all("/$k1(.*)$k2/s",$html->innertext,$out);
    echo $out[0][0];
}

?>

there is multiple ingredients in this page . i want all of them . so using preg_match_all()
it will be helpful if anybody detect the bug of this code .
thanks in advance.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

南城旧梦 2024-12-06 07:42:46

当您已经在使用 HTML 解析器(即使是像 SimpleHtmlDom 这样的糟糕解析器)时,为什么还要尝试用 Regex 搞乱事情呢?这就像用手术刀切开病人的身体,然后再用锋利的勺子进行实际手术。

由于我坚信没有人应该使用 SimpleHtmlDom,因为它的代码库很差并且比基于 libxml 的解析器慢得多,因此以下是如何使用 PHP 的本机 DOM 扩展XPath。 XPath 实际上是 X(HT)ML 文档的正则表达式或 SQL。学习它,这样您就再也不用接触 HTML 的正则表达式了。

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('https://laperuanavegana.wordpress.com/2011/06/11/ensalada-tibia-de-quinua-mango-y-tomate/');
libxml_clear_errors();

$recipe = array();
$xpath = new DOMXPath($dom);
$contentDiv = $dom->getElementById('content');
$recipe['title'] = $xpath->evaluate('string(div/h2/a)', $contentDiv);
foreach ($xpath->query('div/div/ul/li', $contentDiv) as $listNode) {
    $recipe['ingredients'][] = $listNode->nodeValue;
}
print_r($recipe);

这将输出:

Array
(
    [title] => Ensalada tibia de quinua, mango y tomate
    [ingredients] => Array
        (
            [0] => 250gr de quinua cocida tibia
            [1] => 1 mango grande
            [2] => 2 tomates
            [3] => Unas hojas de perejil
            [4] => Sal
            [5] => Aceite de oliva
            [6] => Vinagre balsámico
        )

)

请注意,我们不是在解析 http://laperuanavegana.wordpress.com/ 而是实际的博客文章。每当博客所有者添加新帖子时,主 URL 都会更改内容。

要从主页获取所有食谱,您可以使用

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('https://laperuanavegana.wordpress.com');
libxml_clear_errors();
$contentDiv = $dom->getElementById('content');
$xp = new DOMXPath($dom);
$recipes = array();
foreach ($xp->query('div/h2/a|div/div/ul/li', $contentDiv) as $node) {
    echo
        ($node->nodeName === 'a') ? "\n# " : '- ',
        $node->nodeValue,
        PHP_EOL;
}

这将输出

# Ensalada tibia de quinua, mango y tomate
- 250gr de quinua cocida tibia
- 1 mango grande
- 2 tomates
- Unas hojas de perejil
- Sal
- Aceite de oliva
- Vinagre balsámico

# Flan de lúcuma
- 1 lúcuma grandota o 3 pequeñas
- 1/2 litro de leche de soja evaporada
…

另请参阅

When you are already using an HTML parser (even a poor one like SimpleHtmlDom), why are you trying to mess up things with Regex then? That's like using a scalpel to open up the patient and then falling back to a sharpened spoon for the actual surgery.

Since I strongly believe no one should use SimpleHtmlDom because it has a poor codebase and is much slower than libxml based parsers, here is how to do it with PHP's native DOM extension and XPath. XPath is effectively the Regex or SQL for X(HT)ML documents. Learn it, so you will never ever have to touch Regex for HTML again.

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('https://laperuanavegana.wordpress.com/2011/06/11/ensalada-tibia-de-quinua-mango-y-tomate/');
libxml_clear_errors();

$recipe = array();
$xpath = new DOMXPath($dom);
$contentDiv = $dom->getElementById('content');
$recipe['title'] = $xpath->evaluate('string(div/h2/a)', $contentDiv);
foreach ($xpath->query('div/div/ul/li', $contentDiv) as $listNode) {
    $recipe['ingredients'][] = $listNode->nodeValue;
}
print_r($recipe);

This will output:

Array
(
    [title] => Ensalada tibia de quinua, mango y tomate
    [ingredients] => Array
        (
            [0] => 250gr de quinua cocida tibia
            [1] => 1 mango grande
            [2] => 2 tomates
            [3] => Unas hojas de perejil
            [4] => Sal
            [5] => Aceite de oliva
            [6] => Vinagre balsámico
        )

)

Note that we are not parsing http://laperuanavegana.wordpress.com/ but the actual blog post. The main URL will change content whenever the blog owner adds a new post.

To get all the Recipes from the main page, you can use

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTMLFile('https://laperuanavegana.wordpress.com');
libxml_clear_errors();
$contentDiv = $dom->getElementById('content');
$xp = new DOMXPath($dom);
$recipes = array();
foreach ($xp->query('div/h2/a|div/div/ul/li', $contentDiv) as $node) {
    echo
        ($node->nodeName === 'a') ? "\n# " : '- ',
        $node->nodeValue,
        PHP_EOL;
}

This will output

# Ensalada tibia de quinua, mango y tomate
- 250gr de quinua cocida tibia
- 1 mango grande
- 2 tomates
- Unas hojas de perejil
- Sal
- Aceite de oliva
- Vinagre balsámico

# Flan de lúcuma
- 1 lúcuma grandota o 3 pequeñas
- 1/2 litro de leche de soja evaporada
…

and so on

Also see

謌踐踏愛綪 2024-12-06 07:42:46

您需要在那里添加一个问号。它使模式变得不贪婪 - 否则它将获取页面上从第一个 $k1 到最后一个 $k2 的所有内容。如果添加问号,它将始终占用下一个 $k2。

preg_match_all("/$k1(.*?)$k2/s",$html->innertext,$out);

You need to add a question mark there. It makes the pattern ungreedy - otherwise it will take everything form the first $k1 to the last $k2 on the page. If you add the question mark it will always take the next $k2.

preg_match_all("/$k1(.*?)$k2/s",$html->innertext,$out);
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文