如何在 PHP 中解析和处理 HTML/XML?

发布于 2024-09-18 00:33:15 字数 29 浏览 10 评论 0 原文

如何解析 HTML/XML 并从中提取信息?

How can one parse HTML/XML and extract information from it?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(30

十雾 2024-09-25 00:33:16

我在这里没有看到提到的一种通用方法是通过 Tidy 运行 HTML,可以设置为输出保证有效的 XHTML。然后您可以在其上使用任何旧的 XML 库。

但对于您的具体问题,您应该看看这个项目: http:// Fivefilters.org/content-only / -- 这是 可读性 算法的修改版本,它旨在从页面中仅提取文本内容(而不是页眉和页脚)。

One general approach I haven't seen mentioned here is to run HTML through Tidy, which can be set to spit out guaranteed-valid XHTML. Then you can use any old XML library on it.

But to your specific problem, you should take a look at this project: http://fivefilters.org/content-only/ -- it's a modified version of the Readability algorithm, which is designed to extract just the textual content (not headers and footers) from a page.

画▽骨i 2024-09-25 00:33:16

对于 1a 和 2:我会投票支持新的 Symfony Componet 类 DOMCrawler ( DomCrawler)。
此类允许类似于 CSS 选择器的查询。看看这个演示文稿中的实际示例:news-of-the -symfony2-world

该组件设计为独立工作,无需 Symfony 即可使用。

唯一的缺点是它仅适用于 PHP 5.3 或更高版本。

For 1a and 2: I would vote for the new Symfony Componet class DOMCrawler ( DomCrawler ).
This class allows queries similar to CSS Selectors. Take a look at this presentation for real-world examples: news-of-the-symfony2-world.

The component is designed to work standalone and can be used without Symfony.

The only drawback is that it will only work with PHP 5.3 or newer.

平安喜乐 2024-09-25 00:33:16

顺便说一句,这通常称为屏幕抓取。我为此使用的库是 Simple HTML Dom Parser

This is commonly referred to as screen scraping, by the way. The library I have used for this is Simple HTML Dom Parser.

跨年 2024-09-25 00:33:16

我们之前已经根据自己的需求创建了相当多的爬虫。归根结底,简单的正则表达式通常效果最好。虽然上面列出的库因其创建原因而很好,但如果您知道自己在寻找什么,则正则表达式是一种更安全的方法,因为您还可以处理无效的 HTML/XHTML结构,如果通过大多数解析器加载,就会失败。

We have created quite a few crawlers for our needs before. At the end of the day, it is usually simple regular expressions that do the thing best. While libraries listed above are good for the reason they are created, if you know what you are looking for, regular expressions is a safer way to go, as you can handle also non-valid HTML/XHTML structures, which would fail, if loaded via most of the parsers.

有深☉意 2024-09-25 00:33:16

我推荐 PHP 简单 HTML DOM 解析器

它确实有很好的功能,例如:

foreach($html->find('img') as $element)
       echo $element->src . '<br>';

I recommend PHP Simple HTML DOM Parser.

It really has nice features, like:

foreach($html->find('img') as $element)
       echo $element->src . '<br>';
原野 2024-09-25 00:33:16

这听起来像是 W3C XPath 技术的一个很好的任务描述。表达诸如“返回嵌套在 元素中的 img 标记中的所有 href 属性 之类的查询很容易代码>。”由于不是 PHP 爱好者,我无法告诉您 XPath 可以以什么形式提供。如果您可以调用外部程序来处理 HTML 文件,您应该能够使用 XPath 的命令行版本。
有关快速介绍,请参阅 http://en.wikipedia.org/wiki/XPath

This sounds like a good task description of W3C XPath technology. It's easy to express queries like "return all href attributes in img tags that are nested in <foo><bar><baz> elements." Not being a PHP buff, I can't tell you in what form XPath may be available. If you can call an external program to process the HTML file you should be able to use a command line version of XPath.
For a quick intro, see http://en.wikipedia.org/wiki/XPath.

感性 2024-09-25 00:33:16

使用 DOM 而不是字符串解析的 SimpleHtmlDom 的第三方替代方案:phpQueryZend_Dom, QueryPathFluentDom

Third party alternatives to SimpleHtmlDom that use DOM instead of String Parsing: phpQuery, Zend_Dom, QueryPath and FluentDom.

妄断弥空 2024-09-25 00:33:16

是的,您可以使用 simple_html_dom 来达到此目的。然而,我对 simple_html_dom 做了很多工作,特别是对于网页抓取,并发现它太脆弱了。它完成了基本的工作,但无论如何我不会推荐它。

我从未使用过curl来达到这个目的,但我了解到curl可以更有效地完成这项工作并且更可靠。

请查看此链接:scraping-websites-with-curl

Yes you can use simple_html_dom for the purpose. However I have worked quite a lot with the simple_html_dom, particularly for web scraping and have found it to be too vulnerable. It does the basic job but I won't recommend it anyways.

I have never used curl for the purpose but what I have learned is that curl can do the job much more efficiently and is much more solid.

Kindly check out this link:scraping-websites-with-curl

寄人书 2024-09-25 00:33:16

高级 Html Dom 是一个简单的 HTML DOM 替换提供相同的接口,但它是基于 DOM 的,这意味着不会发生任何相关的内存问题。

它还具有完整的 CSS 支持,包括 jQuery 扩展。

Advanced Html Dom is a simple HTML DOM replacement that offers the same interface, but it's DOM-based which means none of the associated memory issues occur.

It also has full CSS support, including jQuery extensions.

泼猴你往哪里跑 2024-09-25 00:33:16

QueryPath 很好,但要小心“跟踪状态”,因为如果您没有意识到它的含义,则可能意味着您会浪费大量调试时间来尝试找出发生了什么以及代码的原因不起作用。

这意味着对结果集的每次调用都会修改对象中的结果集,它不像 jquery 那样可链接,其中每个链接都是一个新集,您有一个集合,它是查询的结果,并且每个函数调用都会修改那一套。

为了获得类似 jquery 的行为,您需要在执行类似过滤/修改的操作之前进行分支,这意味着它将更紧密地反映 jquery 中发生的情况。

$results = qp("div p");
$forename = $results->find("input[name='forename']");

$results 现在包含 input[name='forename'] 的结果集,而不是原始查询 "div p" 这让我犯了一个错误很多,我发现QueryPath跟踪过滤器并查找以及修改结果并将其存储在对象中的所有内容。你需要这样做,

$forename = $results->branch()->find("input[name='forname']")

然后 $results 不会被修改,你可以一次又一次地重用结果集,也许有更多知识的人可以澄清这一点,但它基本上是这样的从我发现的情况来看。

QueryPath is good, but be careful of "tracking state" cause if you didn't realise what it means, it can mean you waste a lot of debugging time trying to find out what happened and why the code doesn't work.

What it means is that each call on the result set modifies the result set in the object, it's not chainable like in jquery where each link is a new set, you have a single set which is the results from your query and each function call modifies that single set.

in order to get jquery-like behaviour, you need to branch before you do a filter/modify like operation, that means it'll mirror what happens in jquery much more closely.

$results = qp("div p");
$forename = $results->find("input[name='forename']");

$results now contains the result set for input[name='forename'] NOT the original query "div p" this tripped me up a lot, what I found was that QueryPath tracks the filters and finds and everything which modifies your results and stores them in the object. you need to do this instead

$forename = $results->branch()->find("input[name='forname']")

then $results won't be modified and you can reuse the result set again and again, perhaps somebody with much more knowledge can clear this up a bit, but it's basically like this from what I've found.

毁虫ゝ 2024-09-25 00:33:16

对于 HTML5,html5 库已被放弃多年。我能找到的唯一具有最近更新和维护记录的 HTML5 库是 html5- php 一周多前刚刚发布到 beta 1.0。

For HTML5, html5 lib has been abandoned for years now. The only HTML5 library I can find with a recent update and maintenance records is html5-php which was just brought to beta 1.0 a little over a week ago.

顾忌 2024-09-25 00:33:16

我创建了一个名为 PHPPowertools/DOM-Query 的库,它允许您抓取 HTML5 和 XML 文档,就像使用 jQuery 一样。

在底层,它使用 symfony/DomCrawler 将 CSS 选择器转换为 < a href="http://en.wikipedia.org/wiki/XPath" rel="noreferrer">XPath 选择器。即使将一个对象传递给另一个对象,它也始终使用相同的 DomDocument,以确保良好的性能。


使用示例:

namespace PowerTools;

// Get file content
$htmlcode = file_get_contents('https://github.com');

// Define your DOMCrawler based on file string
$H = new DOM_Query($htmlcode);

// Define your DOMCrawler based on an existing DOM_Query instance
$H = new DOM_Query($H->select('body'));

// Passing a string (CSS selector)
$s = $H->select('div.foo');

// Passing an element object (DOM Element)
$s = $H->select($documentBody);

// Passing a DOM Query object
$s = $H->select( $H->select('p + p'));

// Select the body tag
$body = $H->select('body');

// Combine different classes as one selector to get all site blocks
$siteblocks = $body->select('.site-header, .masthead, .site-body, .site-footer');

// Nest your methods just like you would with jQuery
$siteblocks->select('button')->add('span')->addClass('icon icon-printer');

// Use a lambda function to set the text of all site blocks
$siteblocks->text(function( $i, $val) {
    return $i . " - " . $val->attr('class');
});

// Append the following HTML to all site blocks
$siteblocks->append('<div class="site-center"></div>');

// Use a descendant selector to select the site's footer
$sitefooter = $body->select('.site-footer > .site-center');

// Set some attributes for the site's footer
$sitefooter->attr(array('id' => 'aweeesome', 'data-val' => 'see'));

// Use a lambda function to set the attributes of all site blocks
$siteblocks->attr('data-val', function( $i, $val) {
    return $i . " - " . $val->attr('class') . " - photo by Kelly Clark";
});

// Select the parent of the site's footer
$sitefooterparent = $sitefooter->parent();

// Remove the class of all i-tags within the site's footer's parent
$sitefooterparent->select('i')->removeAttr('class');

// Wrap the site's footer within two nex selectors
$sitefooter->wrap('<section><div class="footer-wrapper"></div></section>');

[...]

支持的方法:


  1. 重命名为“select”,原因显而易见
  2. 重命名为“void”,因为'empty' 是 PHP 中的保留字

注意:

该库还包含其自己的 PSR-0 兼容库的零配置自动加载器。包含的示例应该开箱即用,无需任何额外配置。或者,您可以将其与作曲家一起使用。

I created a library named PHPPowertools/DOM-Query, which allows you to crawl HTML5 and XML documents just like you do with jQuery.

Under the hood, it uses symfony/DomCrawler for conversion of CSS selectors to XPath selectors. It always uses the same DomDocument, even when passing one object to another, to ensure decent performance.


Example use :

namespace PowerTools;

// Get file content
$htmlcode = file_get_contents('https://github.com');

// Define your DOMCrawler based on file string
$H = new DOM_Query($htmlcode);

// Define your DOMCrawler based on an existing DOM_Query instance
$H = new DOM_Query($H->select('body'));

// Passing a string (CSS selector)
$s = $H->select('div.foo');

// Passing an element object (DOM Element)
$s = $H->select($documentBody);

// Passing a DOM Query object
$s = $H->select( $H->select('p + p'));

// Select the body tag
$body = $H->select('body');

// Combine different classes as one selector to get all site blocks
$siteblocks = $body->select('.site-header, .masthead, .site-body, .site-footer');

// Nest your methods just like you would with jQuery
$siteblocks->select('button')->add('span')->addClass('icon icon-printer');

// Use a lambda function to set the text of all site blocks
$siteblocks->text(function( $i, $val) {
    return $i . " - " . $val->attr('class');
});

// Append the following HTML to all site blocks
$siteblocks->append('<div class="site-center"></div>');

// Use a descendant selector to select the site's footer
$sitefooter = $body->select('.site-footer > .site-center');

// Set some attributes for the site's footer
$sitefooter->attr(array('id' => 'aweeesome', 'data-val' => 'see'));

// Use a lambda function to set the attributes of all site blocks
$siteblocks->attr('data-val', function( $i, $val) {
    return $i . " - " . $val->attr('class') . " - photo by Kelly Clark";
});

// Select the parent of the site's footer
$sitefooterparent = $sitefooter->parent();

// Remove the class of all i-tags within the site's footer's parent
$sitefooterparent->select('i')->removeAttr('class');

// Wrap the site's footer within two nex selectors
$sitefooter->wrap('<section><div class="footer-wrapper"></div></section>');

[...]

Supported methods :


  1. Renamed 'select', for obvious reasons
  2. Renamed 'void', since 'empty' is a reserved word in PHP

NOTE :

The library also includes its own zero-configuration autoloader for PSR-0 compatible libraries. The example included should work out of the box without any additional configuration. Alternatively, you can use it with composer.

独孤求败 2024-09-25 00:33:16

您可以尝试使用 HTML Tidy 之类的东西来清理任何“损坏的”HTML 并将 HTML 转换为 XHTML ,然后您可以使用 XML 解析器对其进行解析。

You could try using something like HTML Tidy to cleanup any "broken" HTML and convert the HTML to XHTML, which you can then parse with a XML parser.

魂牵梦绕锁你心扉 2024-09-25 00:33:16

我编写了一个通用 XML 解析器,可以轻松处理 GB 文件。它基于 XMLReader,并且非常易于使用:

$source = new XmlExtractor("path/to/tag", "/path/to/file.xml");
foreach ($source as $tag) {
    echo $tag->field1;
    echo $tag->field2->subfield1;
}

这是 github 存储库:XmlExtractor

I have written a general purpose XML parser that can easily handle GB files. It's based on XMLReader and it's very easy to use:

$source = new XmlExtractor("path/to/tag", "/path/to/file.xml");
foreach ($source as $tag) {
    echo $tag->field1;
    echo $tag->field2->subfield1;
}

Here's the github repo: XmlExtractor

冰火雁神 2024-09-25 00:33:16

您可以尝试的另一个选项是QueryPath。它受到 jQuery 的启发,但在 PHP 的服务器上并在 Drupal 中使用。

Another option you can try is QueryPath. It's inspired by jQuery, but on the server in PHP and used in Drupal.

故事↓在人 2024-09-25 00:33:16

XML_HTMLSax 相当稳定 - 即使不再维护。另一种选择可能是通过 Html Tidy 传输 HTML,然后解析它使用标准 XML 工具。

XML_HTMLSax is rather stable - even if it's not maintained any more. Another option could be to pipe you HTML through Html Tidy and then parse it with standard XML tools.

夏日浅笑〃 2024-09-25 00:33:16

处理 HTML/XML DOM 的方法有很多种,其中大部分已经提到过。因此,我不会尝试自己列出这些内容。

我只是想补充一点,我个人更喜欢使用 DOM 扩展以及原因:

  • iit 充分利用了底层 C 代码的性能优势
  • 它是 OO PHP(并允许我对其进行子类化)
  • 它的级别相当低(这允许我使用它作为更高级行为的不臃肿的基础)
  • 它提供对 DOM 的每个部分的访问(与 SimpleXml 不同,它忽略了一些鲜为人知的 XML 功能)
  • 它有一个用于 DOM 爬行的语法,类似于以下语法在本机 JavaScript 中使用。

虽然我怀念为 DOMDocument 使用 CSS 选择器的能力,但有一种相当简单且方便的方法来添加此功能:子类化 DOMDocument 并添加类似 JS 的 querySelectorAllquerySelector 方法到您的子类。

为了解析选择器,我建议使用 CssSelector 组件 href="http://symfony.com/" rel="noreferrer">Symfony 框架。该组件只是将 CSS 选择器转换为 XPath 选择器,然后可以将其输入到 DOMXpath 中以检索相应的 Nodelist。

然后,您可以使用这个(仍然非常低级别)子类作为更高级别类的基础,旨在例如。解析非常特定类型的 XML 或添加更多类似 jQuery 的行为。

下面的代码直接来自我的 DOM-Query 库 并使用我描述的技术。

对于 HTML 解析:

namespace PowerTools;

use \Symfony\Component\CssSelector\CssSelector as CssSelector;

class DOM_Document extends \DOMDocument {
    public function __construct($data = false, $doctype = 'html', $encoding = 'UTF-8', $version = '1.0') {
        parent::__construct($version, $encoding);
        if ($doctype && $doctype === 'html') {
            @$this->loadHTML($data);
        } else {
            @$this->loadXML($data);
        }
    }

    public function querySelectorAll($selector, $contextnode = null) {
        if (isset($this->doctype->name) && $this->doctype->name == 'html') {
            CssSelector::enableHtmlExtension();
        } else {
            CssSelector::disableHtmlExtension();
        }
        $xpath = new \DOMXpath($this);
        return $xpath->query(CssSelector::toXPath($selector, 'descendant::'), $contextnode);
    }

    [...]

    public function loadHTMLFile($filename, $options = 0) {
        $this->loadHTML(file_get_contents($filename), $options);
    }

    public function loadHTML($source, $options = 0) {
        if ($source && $source != '') {
            $data = trim($source);
            $html5 = new HTML5(array('targetDocument' => $this, 'disableHtmlNsInDom' => true));
            $data_start = mb_substr($data, 0, 10);
            if (strpos($data_start, '<!DOCTYPE ') === 0 || strpos($data_start, '<html>') === 0) {
                $html5->loadHTML($data);
            } else {
                @$this->loadHTML('<!DOCTYPE html><html><head><meta charset="' . $encoding . '" /></head><body></body></html>');
                $t = $html5->loadHTMLFragment($data);
                $docbody = $this->getElementsByTagName('body')->item(0);
                while ($t->hasChildNodes()) {
                    $docbody->appendChild($t->firstChild);
                }
            }
        }
    }

    [...]
}

另请参阅使用 CSS 选择器解析 XML 文档< /a> Symfony 的创建者 Fabien Potencier 介绍了他为 Symfony 创建 CssSelector 组件的决定以及如何使用它。

There are many ways to process HTML/XML DOM of which most have already been mentioned. Hence, I won't make any attempt to list those myself.

I merely want to add that I personally prefer using the DOM extension and why :

  • iit makes optimal use of the performance advantage of the underlying C code
  • it's OO PHP (and allows me to subclass it)
  • it's rather low level (which allows me to use it as a non-bloated foundation for more advanced behavior)
  • it provides access to every part of the DOM (unlike eg. SimpleXml, which ignores some of the lesser known XML features)
  • it has a syntax used for DOM crawling that's similar to the syntax used in native Javascript.

And while I miss the ability to use CSS selectors for DOMDocument, there is a rather simple and convenient way to add this feature: subclassing the DOMDocument and adding JS-like querySelectorAll and querySelector methods to your subclass.

For parsing the selectors, I recommend using the very minimalistic CssSelector component from the Symfony framework. This component just translates CSS selectors to XPath selectors, which can then be fed into a DOMXpath to retrieve the corresponding Nodelist.

You can then use this (still very low level) subclass as a foundation for more high level classes, intended to eg. parse very specific types of XML or add more jQuery-like behavior.

The code below comes straight out my DOM-Query library and uses the technique I described.

For HTML parsing :

namespace PowerTools;

use \Symfony\Component\CssSelector\CssSelector as CssSelector;

class DOM_Document extends \DOMDocument {
    public function __construct($data = false, $doctype = 'html', $encoding = 'UTF-8', $version = '1.0') {
        parent::__construct($version, $encoding);
        if ($doctype && $doctype === 'html') {
            @$this->loadHTML($data);
        } else {
            @$this->loadXML($data);
        }
    }

    public function querySelectorAll($selector, $contextnode = null) {
        if (isset($this->doctype->name) && $this->doctype->name == 'html') {
            CssSelector::enableHtmlExtension();
        } else {
            CssSelector::disableHtmlExtension();
        }
        $xpath = new \DOMXpath($this);
        return $xpath->query(CssSelector::toXPath($selector, 'descendant::'), $contextnode);
    }

    [...]

    public function loadHTMLFile($filename, $options = 0) {
        $this->loadHTML(file_get_contents($filename), $options);
    }

    public function loadHTML($source, $options = 0) {
        if ($source && $source != '') {
            $data = trim($source);
            $html5 = new HTML5(array('targetDocument' => $this, 'disableHtmlNsInDom' => true));
            $data_start = mb_substr($data, 0, 10);
            if (strpos($data_start, '<!DOCTYPE ') === 0 || strpos($data_start, '<html>') === 0) {
                $html5->loadHTML($data);
            } else {
                @$this->loadHTML('<!DOCTYPE html><html><head><meta charset="' . $encoding . '" /></head><body></body></html>');
                $t = $html5->loadHTMLFragment($data);
                $docbody = $this->getElementsByTagName('body')->item(0);
                while ($t->hasChildNodes()) {
                    $docbody->appendChild($t->firstChild);
                }
            }
        }
    }

    [...]
}

See also Parsing XML documents with CSS selectors by Symfony's creator Fabien Potencier on his decision to create the CssSelector component for Symfony and how to use it.

二智少女 2024-09-25 00:33:16

Symfony 框架有可以解析 HTML 的包,你可以使用 CSS 样式来选择 < a href="http://en.wikipedia.org/wiki/Document_Object_Model" rel="noreferrer">DOM 而不是使用 XPath

The Symfony framework has bundles which can parse the HTML, and you can use CSS style to select the DOMs instead of using XPath.

他夏了夏天 2024-09-25 00:33:16

通过 FluidXML,您可以使用 XPath 查询和迭代 XML CSS 选择器

$doc = fluidxml('<html>...</html>');

$title = $doc->query('//head/title')[0]->nodeValue;

$doc->query('//body/p', 'div.active', '#bgId')
        ->each(function($i, $node) {
            // $node is a DOMNode.
            $tag   = $node->nodeName;
            $text  = $node->nodeValue;
            $class = $node->getAttribute('class');
        });

https://github.com/servo-php/fluidxml

With FluidXML you can query and iterate XML using XPath and CSS Selectors.

$doc = fluidxml('<html>...</html>');

$title = $doc->query('//head/title')[0]->nodeValue;

$doc->query('//body/p', 'div.active', '#bgId')
        ->each(function($i, $node) {
            // $node is a DOMNode.
            $tag   = $node->nodeName;
            $text  = $node->nodeValue;
            $class = $node->getAttribute('class');
        });

https://github.com/servo-php/fluidxml

一念一轮回 2024-09-25 00:33:16

JSON 和 XML 数组只需三行:

$xml = simplexml_load_string($xml_string);
$json = json_encode($xml);
$array = json_decode($json,TRUE);

Ta da!

JSON and array from XML in three lines:

$xml = simplexml_load_string($xml_string);
$json = json_encode($xml);
$array = json_decode($json,TRUE);

Ta da!

凌乱心跳 2024-09-25 00:33:16

不使用正则表达式解析 HTML 有多种原因。但是,如果您可以完全控制生成的 HTML,那么您可以使用简单的正则表达式。

上面是一个通过正则表达式解析HTML的函数。请注意,此功能非常敏感,要求 HTML 遵守某些规则,但在许多情况下它都可以很好地工作。如果您想要一个简单的解析器,并且不想安装库,请尝试一下:

function array_combine_($keys, $values) {
    $result = array();
    foreach ($keys as $i => $k) {
        $result[$k][] = $values[$i];
    }
    array_walk($result, create_function('&$v', '$v = (count($v) == 1)? array_pop($v): $v;'));

    return $result;
}

function extract_data($str) {
    return (is_array($str))
        ? array_map('extract_data', $str)
        : ((!preg_match_all('#<([A-Za-z0-9_]*)[^>]*>(.*?)</\1>#s', $str, $matches))
            ? $str
            : array_map(('extract_data'), array_combine_($matches[1], $matches[2])));
}

print_r(extract_data(file_get_contents("http://www.google.com/")));

There are several reasons to not parse HTML by regular expression. But, if you have total control of what HTML will be generated, then you can do with simple regular expression.

Above it's a function that parses HTML by regular expression. Note that this function is very sensitive and demands that the HTML obey certain rules, but it works very well in many scenarios. If you want a simple parser, and don't want to install libraries, give this a shot:

function array_combine_($keys, $values) {
    $result = array();
    foreach ($keys as $i => $k) {
        $result[$k][] = $values[$i];
    }
    array_walk($result, create_function('&$v', '$v = (count($v) == 1)? array_pop($v): $v;'));

    return $result;
}

function extract_data($str) {
    return (is_array($str))
        ? array_map('extract_data', $str)
        : ((!preg_match_all('#<([A-Za-z0-9_]*)[^>]*>(.*?)</\1>#s', $str, $matches))
            ? $str
            : array_map(('extract_data'), array_combine_($matches[1], $matches[2])));
}

print_r(extract_data(file_get_contents("http://www.google.com/")));
仅冇旳回忆 2024-09-25 00:33:16

我创建了一个名为 HTML5DOMDocument 的库,可以在 https://github.com 上免费获得/ivopetkov/html5-dom-document-php

它也支持查询选择器,我认为这对您的情况非常有帮助。这是一些示例代码:

$dom = new IvoPetkov\HTML5DOMDocument();
$dom->loadHTML('<!DOCTYPE html><html><body><h1>Hello</h1><div class="content">This is some text</div></body></html>');
echo $dom->querySelector('h1')->innerHTML;

I've created a library called HTML5DOMDocument that is freely available at https://github.com/ivopetkov/html5-dom-document-php

It supports query selectors too that I think will be extremely helpful in your case. Here is some example code:

$dom = new IvoPetkov\HTML5DOMDocument();
$dom->loadHTML('<!DOCTYPE html><html><body><h1>Hello</h1><div class="content">This is some text</div></body></html>');
echo $dom->querySelector('h1')->innerHTML;
枕梦 2024-09-25 00:33:16

解析xml的最佳方法:

$xml='http://www.example.com/rss.xml';
$rss = simplexml_load_string($xml);
$i = 0;
foreach ($rss->channel->item as $feedItem) {
  $i++;
  echo $title=$feedItem->title;
  echo '<br>';
  echo $link=$feedItem->link;
  echo '<br>';
  if($feedItem->description !='') {
    $des=$feedItem->description;
  } else {
    $des='';
  }
  echo $des;
  echo '<br>';
  if($i>5) break;
}

The best method for parse xml:

$xml='http://www.example.com/rss.xml';
$rss = simplexml_load_string($xml);
$i = 0;
foreach ($rss->channel->item as $feedItem) {
  $i++;
  echo $title=$feedItem->title;
  echo '<br>';
  echo $link=$feedItem->link;
  echo '<br>';
  if($feedItem->description !='') {
    $des=$feedItem->description;
  } else {
    $des='';
  }
  echo $des;
  echo '<br>';
  if($i>5) break;
}
紙鸢 2024-09-25 00:33:16

有很多方法:

一般:

  • 原生 XML 扩展:它们与 PHP 捆绑在一起,通常比所有第 3 方库更快,并为我提供了所有控制您需要的标记。

  • DOM: DOM 能够解析和修改现实世界(损坏的)HTML,并且可以执行 XPath 查询。它基于libxml。

  • XML Reader: XMLReader 与 DOM 一样,基于 libxml。 XMLReader 扩展是一个 XML 拉式解析器。阅读器充当文档流上向前移动的光标,并在途中的每个节点处停止

  • XML 解析器:此扩展允许您创建 XML 解析器,然后为不同的 XML 事件定义处理程序。每个 XML 解析器还有一些可以调整的参数。它实现了 SAX 风格的 XML 推送解析器。

  • 简单 XML:SimpleXML 扩展提供了一个非常简单且易于使用的工具集,用于将 XML 转换为可以使用普通属性选择器和数组迭代器处理的对象。

第 3 方库 [ 基于 libxml ]:

  • FluentDom - Repo: FluentDOM 为 PHP 中的 DOMDocument 提供了类似 jQuery 的流畅 XML 接口。它可以加载 JSON、CSV、JsonML、RabbitFish 等格式。可以通过 Composer 安装。

  • HtmlPageDom:是一个 PHP 库,用于轻松操作 HTML 文档,它需要 Symfony2 组件中的 DomCrawler 来遍历 DOM 树,并通过添加操作 HTML 文档的 DOM 树的方法来扩展它。< /p>

  • ZendDOM: Zend_Dom 提供了用于处理 DOM 文档和结构的工具。目前,他们提供 Zend_Dom_Query,它提供了一个统一的接口,用于利用 XPath 和 CSS 选择器查询 DOM 文档。

  • QueryPath:QueryPath 是一个用于操作 XML 和 HTML 的 PHP 库。它不仅可以处理本地文件,还可以处理 Web 服务和数据库资源。它实现了大部分 jQuery 接口(包括 CSS 样式选择器),但它针对服务器端使用进行了大量调整。可以通过 Composer 安装。

  • fDOM Document: fDOMDocument 扩展了标准 DOM,以在所有错误情况下使用异常,而不是 PHP 警告或通知。为了方便和简化 DOM 的使用,他们还添加了各种自定义方法和快捷方式。

  • Sabre/XML: sabre/xml 是一个库,它包装并扩展了 XMLReader 和 XMLWriter 类,以创建简单的“xml 到对象/数组”映射系统和设计模式。写入和读取 XML 是单遍的,因此速度很快,并且在大型 xml 文件上需要的内存较少。

  • FluidXML: FluidXML 是一个 PHP 库,用于通过简洁流畅的 API 来操作 XML。它利用 XPath 和流畅的编程模式,既有趣又有效。

第 3 方库 [ 不是基于 libxml ]:

  • PHP 简单 HTML DOM 解析器: 用 PHP5+ 编写的 HTML DOM 解析器可让您以非常简单的方式操作 HTML,它需要 PHP 5+。还支持无效的 HTML。
    它在一行中从 HTML 中提取内容。代码库很糟糕并且运行速度非常慢。

  • PHP Html 解析器: HPHtmlParser 是一个简单、灵活的 HTML 解析器,允许您使用任何 CSS 选择器(如 jQuery)选择标签。目标是协助开发需要快速、简单的方法来抓取 HTML 的工具,无论它是否有效。它速度慢并且占用太多 CPU 资源。

  • Ganon(推荐):通用分词器和 HTML/XML/RSS DOM 解析器。它具有操纵元素及其属性的能力。它支持无效的 HTML 和 UTF8。它可以对元素执行类似 CSS3 的高级查询(如 jQuery——支持命名空间)。 HTML 美化器(如 HTML Tidy)。缩小 CSS 和 Javascript。它对属性进行排序、更改字符大小写、正确缩进等。
    可扩展。操作分为更小的功能,以便于覆盖和
    快速且易于使用。

Web 服务:

  • 如果您不喜欢 PHP 编程,您也可以使用 Web 服务。 ScraperWiki 的外部接口允许您以您想要的形式提取数据,以便在网络或您自己的应用程序中使用。您还可以提取有关任何抓取工具状态的信息。

我把所有的资源都分享了,大家可以根据自己的口味、有用性等来选择。

There are many ways:

In General:

  • Native XML Extensions: they come bundled with PHP, are usually faster than all the 3rd party libs, and give me all the control you need over the markup.

  • DOM: DOM is capable of parsing and modifying real-world (broken) HTML and it can do XPath queries. It is based on libxml.

  • XML Reader: XMLReader, like DOM, is based on libxml. The XMLReader extension is an XML pull parser. The reader acts as a cursor going forward on the document stream and stopping at each node on the way

  • XML Parser: This extension lets you create XML parsers and then define handlers for different XML events. Each XML parser also has a few parameters you can adjust. It implements a SAX style XML push parser.

  • Simple XML: The SimpleXML extension provides a very simple and easily usable toolset to convert XML to an object that can be processed with normal property selectors and array iterators.

3rd Party Libraries [ libxml based ]:

  • FluentDom - Repo: FluentDOM provides a jQuery-like fluent XML interface for the DOMDocument in PHP. It can load formats like JSON, CSV, JsonML, RabbitFish and others. Can be installed via Composer.

  • HtmlPageDom: is a PHP library for easy manipulation of HTML documents using It requires DomCrawler from Symfony2 components for traversing the DOM tree and extends it by adding methods for manipulating the DOM tree of HTML documents.

  • ZendDOM: Zend_Dom provides tools for working with DOM documents and structures. Currently, they offer Zend_Dom_Query, which provides a unified interface for querying DOM documents utilizing both XPath and CSS selectors.

  • QueryPath: QueryPath is a PHP library for manipulating XML and HTML. It is designed to work not only with local files but also with web services and database resources. It implements much of the jQuery interface (including CSS-style selectors), but it is heavily tuned for server-side use. Can be installed via Composer.

  • fDOM Document: fDOMDocument extends the standard DOM to use exceptions at all occasions of errors instead of PHP warnings or notices. They also add various custom methods and shortcuts for convenience and to simplify the usage of DOM.

  • Sabre/XML: sabre/xml is a library that wraps and extends the XMLReader and XMLWriter classes to create a simple "xml to object/array" mapping system and design pattern. Writing and reading XML is single-pass and can therefore be fast and require low memory on large xml files.

  • FluidXML: FluidXML is a PHP library for manipulating XML with a concise and fluent API. It leverages XPath and the fluent programming pattern to be fun and effective.

3rd Party Libraries [ Not libxml based ]:

  • PHP Simple HTML DOM Parser: An HTML DOM parser written in PHP5+ lets you manipulate HTML in a very easy way, It Requires PHP 5+. Also Supports invalid HTML.
    It Extracts contents from HTML in a single line. The codebase is horrible and very slow in working.

  • PHP Html Parser: HPHtmlParser is a simple, flexible, HTML parser that allows you to select tags using any CSS selector, like jQuery. The goal is to assist in the development of tools that require a quick, easy way to scrape HTML, whether it's valid or not. It is slow and takes too much CPU Power.

  • Ganon (recommended): A universal tokenizer and HTML/XML/RSS DOM Parser. It has the Ability to manipulate elements and their attributes. It Supports invalid HTML and UTF8. It Can perform advanced CSS3-like queries on elements (like jQuery -- namespaces supported). A HTML beautifier (like HTML Tidy). Minify CSS and Javascript. It Sort attributes, change character case, correct indentation, etc.
    Extensible. The Operations separated into smaller functions for easy overriding and
    Fast and Easy to use.

Web Services:

  • If you don't feel like programming PHP, you can also use Web services. ScraperWiki's external interface allows you to extract data in the form you want for use on the web or in your own applications. You can also extract information about the state of any scraper.

I have shared all the resources, you can choose according to your taste, usefulness, etc.

静待花开 2024-09-25 00:33:15

原生 XML 扩展

我更喜欢使用 原生 XML 扩展 之一,因为它们与PHP 通常比所有第三方库更快,并为我提供了对标记所需的所有控制。

DOM

DOM 扩展允许您使用 PHP 5 通过 DOM API 操作 XML 文档。它是 W3C 文档对象模型核心级别 3 的实现,这是一个平台和语言中立的接口,允许程序和脚本动态地访问和更新文档的内容、结构和风格。

DOM 能够解析和修改现实世界(损坏的)HTML,并且可以执行 XPath 查询 。它基于 libxml

使用 DOM 需要一些时间才能提高工作效率,但在我看来,这段时间是值得的。由于 DOM 是一个与语言无关的接口,您会发现多种语言的实现,因此如果您需要更改编程语言,那么您很可能已经知道如何使用该语言的 DOM API。

如何使用 DOM 扩展已在 StackOverflow 上广泛介绍,因此,如果您选择使用它,您可以确定您遇到的大多数问题都可以通过搜索/浏览 Stack Overflow 来解决。

基本用法示例一般概念概述< /a> 可在其他答案中找到。

XMLReader

XMLReader 扩展是一个 XML 拉式解析器。阅读器充当文档流上向前移动的光标,并在途中的每个节点处停止。

XMLReader 与 DOM 一样,基于 libxml。我不知道如何触发 HTML 解析器模块,因此使用 XMLReader 解析损坏的 HTML 可能不如使用 DOM 强大,在 DOM 中您可以明确地告诉它使用 libxml 的 HTML 解析器模块。

另一个答案中提供了基本用法示例

XML 解析器

此扩展允许您创建 XML 解析器,然后为不同的 XML 事件定义处理程序。每个 XML 解析器还有一些可以调整的参数。

XML 解析器库也基于 libxml,并实现 SAX 风格的 XML 推送解析器。对于内存管理来说,它可能是比 DOM 或 SimpleXML 更好的选择,但比 XMLReader 实现的拉解析器更难使用。

SimpleXml

SimpleXML 扩展提供了一个非常简单且易于使用的工具集,用于将 XML 转换为可以使用普通属性选择器和数组迭代器处理的对象。

当您知道 HTML 是有效的 XHTML 时,SimpleXML 是一个选项。如果您需要解析损坏的 HTML,甚至不要考虑 SimpleXml,因为它会令人窒息。

提供了基本使用示例,并且有PHP 手册中有很多其他示例


3rd 方库(基于 libxml)

如果您更喜欢使用 3rd 方库,我建议使用实际使用 DOM/libxml 下面而不是字符串解析。

FluentDom

FluentDOM 为 PHP 中的 DOMDocument 提供了类似 jQuery 的流畅 XML 接口。选择器是用 XPath 或 CSS 编写的(使用 CSS 到 XPath 转换器)。当前版本扩展了 DOM 实现标准接口并添加了 DOM Living Standard 的功能。 FluentDOM 可以加载 JSON、CSV、JsonML、RabbitFish 等格式。可以通过 Composer 安装。

HtmlPageDom

Wa72\HtmlPageDom 是一个用于轻松操作 HTML 的 PHP 库
使用 DOM 的文档。它需要来自 Symfony2 的 DomCrawler
用于遍历的组件

DOM 树并通过添加操作方法来扩展它
HTML 文档的 DOM 树。

phpQuery

phpQuery 是一个服务器端、可链接、CSS3 选择器驱动的文档对象模型 (DOM) API,基于 jQuery JavaScript 库。
该库是用 PHP5 编写的,并提供额外的命令行界面 (CLI)。

这被描述为“废弃软件和错误:使用时需要您自担风险”,但似乎维护程度很低。

laminas-dom

Laminas\Dom 组件(以前称为 Zend_DOM)提供了处理 DOM 文档和结构的工具。目前,我们提供 Laminas\Dom\Query,它提供了一个统一的界面,用于利用 XPath 和 CSS 选择器查询 DOM 文档。

此软件包被认为功能完整,现在处于仅安全维护模式。

fDOMDocument

fDOMDocument 扩展了标准 DOM,以在所有错误情况下使用异常,而不是 PHP 警告或通知。为了方便和简化 DOM 的使用,他们还添加了各种自定义方法和快捷方式。

sabre/xml

sabre/xml 是一个库,它包装并扩展了 XMLReader 和 XMLWriter 类,以创建简单的“xml 到对象/数组”映射系统和设计模式。写入和读取 XML 是单遍的,因此速度很快,并且在大型 xml 文件上需要的内存较少。

FluidXML

FluidXML 是一个 PHP 库,用于通过简洁流畅的 API 来操作 XML。
它利用 XPath 和流畅的编程模式,既有趣又有效。


3rd-Party(不是基于 libxml)

基于 DOM/libxml 构建的好处是,您可以立即获得良好的性能,因为您基于本机扩展。然而,并非所有第三方库都走这条路。下面列出了其中一些

PHP 简单 HTML DOM 解析器

  • 用 PHP5+ 编写的 HTML DOM 解析器可让您以非常简单的方式操作 HTML!
  • 需要 PHP 5+。
  • 支持无效 HTML。
  • 使用选择器在 HTML 页面上查找标签,就像 jQuery 一样。
  • 在一行中从 HTML 中提取内容。

我一般不推荐这个解析器。代码库很糟糕,解析器本身相当慢并且占用内存。并非所有 jQuery 选择器(例如 子选择器)都是可行的。任何基于 libxml 的库都应该轻松超越这一点。

PHP Html 解析器

PHPHtmlParser 是一个简单、灵活的 html 解析器,它允许您使用任何 css 选择器(例如 jQuery)来选择标签。我们的目标是协助开发需要快速、简单的方法来抓取 html 的工具,无论它是否有效!这个项目最初是由 sunra/php-simple-html-dom-parser 支持的,但是支持似乎已经停止了,所以这个项目是我对他之前工作的改编。

再说一次,我不会推荐这个解析器。 CPU 使用率高时速度相当慢。也没有清除创建的 DOM 对象内存的功能。这些问题在嵌套循环中尤其严重。文档本身不准确且拼写错误,自 2016 年 4 月 14 日以来没有对修复做出任何响应。


HTML 5

您可以使用上述内容来解析 HTML5,但是可以由于 HTML5 允许的标记,这可能是怪癖。因此,对于 HTML5,您可能需要考虑使用专用解析器。请注意,这些是用 PHP 编写的,因此与使用较低级别语言编译的扩展相比,性能较慢且内存使用量增加。

HTML5DomDocument

HTML5DOMDocument 扩展了本机 DOMDocument 库。它修复了一些错误并添加了一些新功能。

  • 保留 html 实体(DOMDocument 不保留)
  • 保留 void 标签(DOMDocument 不保留)
  • 允许插入 HTML 代码,将正确的部分移动到正确的位置(head 元素插入到 head 中,body 元素插入到 body 中)
  • 允许使用 CSS 选择器查询 DOM(当前可用:*tagnametagname#id#id, <代码>tagname.classname, .classname, tagname.classname.classname2, .classname.classname2, <代码>标签名[属性选择器], [属性选择器], div, p, div p, div > pdiv + pp ~ ul。)
  • 添加了对 element->classList 的支持。
  • 添加了对 element->innerHTML 的支持。
  • 添加了对 element->outerHTML 的支持。

HTML5

HTML5 是完全用 PHP 编写的符合标准的 HTML5 解析器和编写器。它很稳定,并在许多生产网站中使用,下载量远远超过 500 万次。

HTML5 提供以下功能。

  • HTML5 序列化器
  • 支持 PHP 命名空间
  • 作曲家支持
  • 基于事件(类似 SAX)的解析器
  • DOM 树构建器
  • 与 QueryPath 的互操作性
  • 在 PHP 5.3.0 或更高版本上运行

正则表达式

最后也是最不推荐,您可以提取使用正则表达式从 HTML 获取数据。一般来说,不鼓励在 HTML 上使用正则表达式。

您在网络上找到的大多数用于匹配标记的片段都很脆弱。在大多数情况下,它们仅适用于非常特定的 HTML 片段。微小的标记更改(例如在某处添加空格,或者在标记中添加或更改属性)可能会导致正则表达式在编写不正确时失败。在 HTML 上使用 RegEx 之前,您应该知道自己在做什么。

HTML 解析器已经知道 HTML 的语法规则。必须为您编写的每个新正则表达式教授正则表达式。正则表达式在某些情况下很好,但这实际上取决于您的用例。

可以编写更可靠的解析器,但是使用正则表达式编写完整且可靠的自定义解析器是一种浪费当上述库已经存在并且在这方面做得更好的时候。

另请参阅克苏鲁方式解析 Html


书籍

如果你想花点钱,可以看看

我不隶属于 PHP 架构师或作者。

Native XML Extensions

I prefer using one of the native XML extensions since they come bundled with PHP, are usually faster than all the 3rd party libs and give me all the control I need over the markup.

DOM

The DOM extension allows you to operate on XML documents through the DOM API with PHP 5. It is an implementation of the W3C's Document Object Model Core Level 3, a platform- and language-neutral interface that allows programs and scripts to dynamically access and update the content, structure and style of documents.

DOM is capable of parsing and modifying real world (broken) HTML and it can do XPath queries. It is based on libxml.

It takes some time to get productive with DOM, but that time is well worth it IMO. Since DOM is a language-agnostic interface, you'll find implementations in many languages, so if you need to change your programming language, chances are you will already know how to use that language's DOM API then.

How to use the DOM extension has been covered extensively on StackOverflow, so if you choose to use it, you can be sure most of the issues you run into can be solved by searching/browsing Stack Overflow.

A basic usage example and a general conceptual overview are available in other answers.

XMLReader

The XMLReader extension is an XML pull parser. The reader acts as a cursor going forward on the document stream and stopping at each node on the way.

XMLReader, like DOM, is based on libxml. I am not aware of how to trigger the HTML Parser Module, so chances are using XMLReader for parsing broken HTML might be less robust than using DOM where you can explicitly tell it to use libxml's HTML Parser Module.

A basic usage example is available in another answer.

XML Parser

This extension lets you create XML parsers and then define handlers for different XML events. Each XML parser also has a few parameters you can adjust.

The XML Parser library is also based on libxml, and implements a SAX style XML push parser. It may be a better choice for memory management than DOM or SimpleXML, but will be more difficult to work with than the pull parser implemented by XMLReader.

SimpleXml

The SimpleXML extension provides a very simple and easily usable toolset to convert XML to an object that can be processed with normal property selectors and array iterators.

SimpleXML is an option when you know the HTML is valid XHTML. If you need to parse broken HTML, don't even consider SimpleXml because it will choke.

A basic usage example is available, and there are lots of additional examples in the PHP Manual.


3rd Party Libraries (libxml based)

If you prefer to use a 3rd-party lib, I'd suggest using a lib that actually uses DOM/libxml underneath instead of string parsing.

FluentDom

FluentDOM provides a jQuery-like fluent XML interface for the DOMDocument in PHP. Selectors are written in XPath or CSS (using a CSS to XPath converter). Current versions extend the DOM implementing standard interfaces and add features from the DOM Living Standard. FluentDOM can load formats like JSON, CSV, JsonML, RabbitFish and others. Can be installed via Composer.

HtmlPageDom

Wa72\HtmlPageDom is a PHP library for easy manipulation of HTML
documents using DOM. It requires DomCrawler from Symfony2
components
for traversing
the DOM tree and extends it by adding methods for manipulating the
DOM tree of HTML documents.

phpQuery

phpQuery is a server-side, chainable, CSS3 selector driven Document Object Model (DOM) API based on jQuery JavaScript Library.
The library is written in PHP5 and provides additional Command Line Interface (CLI).

This is described as "abandonware and buggy: use at your own risk" but does appear to be minimally maintained.

laminas-dom

The Laminas\Dom component (formerly Zend_DOM) provides tools for working with DOM documents and structures. Currently, we offer Laminas\Dom\Query, which provides a unified interface for querying DOM documents utilizing both XPath and CSS selectors.

This package is considered feature-complete, and is now in security-only maintenance mode.

fDOMDocument

fDOMDocument extends the standard DOM to use exceptions at all occasions of errors instead of PHP warnings or notices. They also add various custom methods and shortcuts for convenience and to simplify the usage of DOM.

sabre/xml

sabre/xml is a library that wraps and extends the XMLReader and XMLWriter classes to create a simple "xml to object/array" mapping system and design pattern. Writing and reading XML is single-pass and can therefore be fast and require low memory on large xml files.

FluidXML

FluidXML is a PHP library for manipulating XML with a concise and fluent API.
It leverages XPath and the fluent programming pattern to be fun and effective.


3rd-Party (not libxml-based)

The benefit of building upon DOM/libxml is that you get good performance out of the box because you are based on a native extension. However, not all 3rd-party libs go down this route. Some of them listed below

PHP Simple HTML DOM Parser

  • An HTML DOM parser written in PHP5+ lets you manipulate HTML in a very easy way!
  • Require PHP 5+.
  • Supports invalid HTML.
  • Find tags on an HTML page with selectors just like jQuery.
  • Extract contents from HTML in a single line.

I generally do not recommend this parser. The codebase is horrible and the parser itself is rather slow and memory hungry. Not all jQuery Selectors (such as child selectors) are possible. Any of the libxml based libraries should outperform this easily.

PHP Html Parser

PHPHtmlParser is a simple, flexible, html parser which allows you to select tags using any css selector, like jQuery. The goal is to assiste in the development of tools which require a quick, easy way to scrape html, whether it's valid or not! This project was original supported by sunra/php-simple-html-dom-parser but the support seems to have stopped so this project is my adaptation of his previous work.

Again, I would not recommend this parser. It is rather slow with high CPU usage. There is also no function to clear memory of created DOM objects. These problems scale particularly with nested loops. The documentation itself is inaccurate and misspelled, with no responses to fixes since 14 Apr 16.


HTML 5

You can use the above for parsing HTML5, but there can be quirks due to the markup HTML5 allows. So for HTML5 you may want to consider using a dedicated parser. Note that these are written in PHP, so suffer from slower performance and increased memory usage compared to a compiled extension in a lower-level language.

HTML5DomDocument

HTML5DOMDocument extends the native DOMDocument library. It fixes some bugs and adds some new functionality.

  • Preserves html entities (DOMDocument does not)
  • Preserves void tags (DOMDocument does not)
  • Allows inserting HTML code that moves the correct parts to their proper places (head elements are inserted in the head, body elements in the body)
  • Allows querying the DOM with CSS selectors (currently available: *, tagname, tagname#id, #id, tagname.classname, .classname, tagname.classname.classname2, .classname.classname2, tagname[attribute-selector], [attribute-selector], div, p, div p, div > p, div + p, and p ~ ul.)
  • Adds support for element->classList.
  • Adds support for element->innerHTML.
  • Adds support for element->outerHTML.

HTML5

HTML5 is a standards-compliant HTML5 parser and writer written entirely in PHP. It is stable and used in many production websites, and has well over five million downloads.

HTML5 provides the following features.

  • An HTML5 serializer
  • Support for PHP namespaces
  • Composer support
  • Event-based (SAX-like) parser
  • A DOM tree builder
  • Interoperability with QueryPath
  • Runs on PHP 5.3.0 or newer

Regular Expressions

Last and least recommended, you can extract data from HTML with regular expressions. In general using Regular Expressions on HTML is discouraged.

Most of the snippets you will find on the web to match markup are brittle. In most cases they are only working for a very particular piece of HTML. Tiny markup changes, like adding whitespace somewhere, or adding, or changing attributes in a tag, can make the RegEx fails when it's not properly written. You should know what you are doing before using RegEx on HTML.

HTML parsers already know the syntactical rules of HTML. Regular expressions have to be taught for each new RegEx you write. RegEx are fine in some cases, but it really depends on your use-case.

You can write more reliable parsers, but writing a complete and reliable custom parser with regular expressions is a waste of time when the aforementioned libraries already exist and do a much better job on this.

Also see Parsing Html The Cthulhu Way


Books

If you want to spend some money, have a look at

I am not affiliated with PHP Architect or the authors.

捎一片雪花 2024-09-25 00:33:15

尝试简单 HTML DOM 解析器

  • 用 PHP 5+ 编写的 HTML DOM 解析器,可让您以非常简单的方式操作 HTML!
  • 需要 PHP 5+。
  • 支持无效的 HTML。
  • 使用选择器在 HTML 页面上查找标签,就像 jQuery 一样。
  • 在一行中从 HTML 中提取内容。
  • 下载

注意:顾名思义,它对于简单任务很有用。它使用正则表达式而不是 HTML 解析器,因此对于更复杂的任务来说速度会慢得多。其大部分代码库是在 2008 年编写的,此后仅进行了少量改进。它不遵循现代 PHP 编码标准,并且很难合并到现代 PSR 兼容项目中。

示例:

如何获取 HTML 元素:

// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');

// Find all images
foreach($html->find('img') as $element)
       echo $element->src . '<br>';

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>';

如何修改 HTML 元素:

// Create DOM from string
$html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>');

$html->find('div', 1)->class = 'bar';

$html->find('div[id=hello]', 0)->innertext = 'foo';

echo $html;

从 HTML 中提取内容:

// Dump contents (without tags) from HTML
echo file_get_html('http://www.google.com/')->plaintext;

抓取斜线:

// Create DOM from URL
$html = file_get_html('http://slashdot.org/');

// Find all article blocks
foreach($html->find('div.article') as $article) {
    $item['title']     = $article->find('div.title', 0)->plaintext;
    $item['intro']    = $article->find('div.intro', 0)->plaintext;
    $item['details'] = $article->find('div.details', 0)->plaintext;
    $articles[] = $item;
}

print_r($articles);

Try Simple HTML DOM Parser.

  • A HTML DOM parser written in PHP 5+ that lets you manipulate HTML in a very easy way!
  • Require PHP 5+.
  • Supports invalid HTML.
  • Find tags on an HTML page with selectors just like jQuery.
  • Extract contents from HTML in a single line.
  • Download

Note: as the name suggests, it can be useful for simple tasks. It uses regular expressions instead of an HTML parser, so will be considerably slower for more complex tasks. The bulk of its codebase was written in 2008, with only small improvements made since then. It does not follow modern PHP coding standards and would be challenging to incorporate into a modern PSR-compliant project.

Examples:

How to get HTML elements:

// Create DOM from URL or file
$html = file_get_html('http://www.example.com/');

// Find all images
foreach($html->find('img') as $element)
       echo $element->src . '<br>';

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>';

How to modify HTML elements:

// Create DOM from string
$html = str_get_html('<div id="hello">Hello</div><div id="world">World</div>');

$html->find('div', 1)->class = 'bar';

$html->find('div[id=hello]', 0)->innertext = 'foo';

echo $html;

Extract content from HTML:

// Dump contents (without tags) from HTML
echo file_get_html('http://www.google.com/')->plaintext;

Scraping Slashdot:

// Create DOM from URL
$html = file_get_html('http://slashdot.org/');

// Find all article blocks
foreach($html->find('div.article') as $article) {
    $item['title']     = $article->find('div.title', 0)->plaintext;
    $item['intro']    = $article->find('div.intro', 0)->plaintext;
    $item['details'] = $article->find('div.details', 0)->plaintext;
    $articles[] = $item;
}

print_r($articles);
走野 2024-09-25 00:33:15

只需使用 DOMDocument->loadHTML() 即可完成。 libxml 的 HTML 解析算法非常好且快速,并且与流行的看法相反,它不会因格式错误的 HTML 而窒息。

Just use DOMDocument->loadHTML() and be done with it. libxml's HTML parsing algorithm is quite good and fast, and contrary to popular belief, does not choke on malformed HTML.

野却迷人 2024-09-25 00:33:15

为什么不应该以及什么时候应该使用正则表达式?

首先,一个常见的用词不当:正则表达式不用于解析 HTML。然而,正则表达式可以提取数据。提取是它们的用途。与适当的 SGML 工具包或基线 XML 解析器相比,正则表达式 HTML 提取的主要缺点是它们的语法工作和不同的可靠性。

考虑一下,制作一个稍微可靠的 HTML 提取正则表达式:

<a\s+class="?playbutton\d?[^>]+id="(\d+)".+?    <a\s+class="[\w\s]*title
[\w\s]*"[^>]+href="(http://[^">]+)"[^>]*>([^<>]+)</a>.+?

比简单的 phpQuery 或 QueryPath 等效项可读性差得多:

$div->find(".stationcool a")->attr("title");

但是,在某些特定的用例中它们可以提供帮助。

  • 许多 DOM 遍历前端不会显示 HTML 注释
  • 通常,正则表达式可以节省后处理。然而 HTML 实体通常需要手动维护。
  • 最后,对于极其简单的任务(例如提取

有时甚至建议使用正则表达式 /(.+?)/ 预先提取 HTML 片段使用更简单的 HTML 解析器前端处理其余部分。

注意:我实际上有这个应用,我在其中使用XML解析和正则表达式交替。就在上周,PyQuery 解析失败了,而正则表达式仍然有效。是的,很奇怪,我自己无法解释。但事情就这样发生了。
因此,请不要仅仅因为现实世界的考虑因素与 regex=evil meme 不匹配而否决它。 但我们也不要对此投太多赞成票。这只是本主题的一个旁注。

Why you shouldn't and when you should use regular expressions?

First off, a common misnomer: Regexps are not for "parsing" HTML. Regexes can however "extract" data. Extracting is what they're made for. The major drawback of regex HTML extraction over proper SGML toolkits or baseline XML parsers are their syntactic effort and varying reliability.

Consider that making a somewhat dependable HTML extraction regex:

<a\s+class="?playbutton\d?[^>]+id="(\d+)".+?    <a\s+class="[\w\s]*title
[\w\s]*"[^>]+href="(http://[^">]+)"[^>]*>([^<>]+)</a>.+?

is way less readable than a simple phpQuery or QueryPath equivalent:

$div->find(".stationcool a")->attr("title");

There are however specific use cases where they can help.

  • Many DOM traversal frontends don't reveal HTML comments <!--, which however are sometimes the more useful anchors for extraction. In particular pseudo-HTML variations <$var> or SGML residues are easy to tame with regexps.
  • Oftentimes regular expressions can save post-processing. However HTML entities often require manual caretaking.
  • And lastly, for extremely simple tasks like extracting <img src= urls, they are in fact a probable tool. The speed advantage over SGML/XML parsers mostly just comes to play for these very basic extraction procedures.

It's sometimes even advisable to pre-extract a snippet of HTML using regular expressions /<!--CONTENT-->(.+?)<!--END-->/ and process the remainder using the simpler HTML parser frontends.

Note: I actually have this app, where I employ XML parsing and regular expressions alternatively. Just last week the PyQuery parsing broke, and the regex still worked. Yes weird, and I can't explain it myself. But so it happened.
So please don't vote real-world considerations down, just because it doesn't match the regex=evil meme. But let's also not vote this up too much. It's just a sidenote for this topic.

神妖 2024-09-25 00:33:15

请注意,此答案推荐了现已废弃 10 多年的库。

phpQueryQueryPath 在复制流畅的 jQuery API 方面非常相似。这也是为什么它们是在 PHP 中正确解析 HTML 的两种最简单方法。

QueryPath 示例

基本上,您首先从 HTML 字符串创建可查询的 DOM 树:

 $qp = qp("<html><body><h1>title</h1>..."); // or give filename or URL

生成的对象包含 HTML 文档的完整树表示。可以使用 DOM 方法来遍历它。但常见的方法是像 jQuery 一样使用 CSS 选择器:

 $qp->find("div.classname")->children()->...;

 foreach ($qp->find("p img") as $img) {
     print qp($img)->attr("src");
 }

大多数情况下,您希望使用简单的 #id.classDIV 标签选择器->find()。但您也可以使用 XPath 语句,有时速度更快。还有典型的 jQuery 方法,如 ->children()->text(),特别是 ->attr() 简化了提取正确的 HTML 片段。 (并且已经解码了它们的 SGML 实体。)

 $qp->xpath("//div/p[1]");  // get first paragraph in a div

QueryPath 还允许将新标签注入流中 (->append),然后输出并美化更新的文档 (->writeHTML< /代码>)。它不仅可以解析格式错误的 HTML,还可以解析各种 XML 方言(带有命名空间),甚至可以从 HTML 微格式(XFN、vCard)中提取数据。

 $qp->find("a[target=_blank]")->toggleClass("usability-blunder");

phpQuery 还是 QueryPath?

一般来说,QueryPath 更适合文档操作。虽然 phpQuery 还实现了一些伪 AJAX 方法(只是 HTTP 请求),以更类似于 jQuery。据说 phpQuery 通常比 QueryPath 更快(因为整体功能较少)。

有关差异的更多信息,请参阅 来自 tagbyte.org 的回程机器上的比较。 (原始来源丢失了,所以这里有一个互联网存档链接。是的,您仍然可以找到丢失的页面。)

优点

  • 简单性和可靠性
  • 易于使用替代方案 ->find("a img, a object, div a")
  • 正确的数据转义(与正则表达式 grep 相比)

Note, this answer recommends libraries that have now been abandoned for 10+ years.

phpQuery and QueryPath are extremely similar in replicating the fluent jQuery API. That's also why they're two of the easiest approaches to properly parse HTML in PHP.

Examples for QueryPath

Basically you first create a queryable DOM tree from an HTML string:

 $qp = qp("<html><body><h1>title</h1>..."); // or give filename or URL

The resulting object contains a complete tree representation of the HTML document. It can be traversed using DOM methods. But the common approach is to use CSS selectors like in jQuery:

 $qp->find("div.classname")->children()->...;

 foreach ($qp->find("p img") as $img) {
     print qp($img)->attr("src");
 }

Mostly you want to use simple #id and .class or DIV tag selectors for ->find(). But you can also use XPath statements, which sometimes are faster. Also typical jQuery methods like ->children() and ->text() and particularly ->attr() simplify extracting the right HTML snippets. (And already have their SGML entities decoded.)

 $qp->xpath("//div/p[1]");  // get first paragraph in a div

QueryPath also allows injecting new tags into the stream (->append), and later output and prettify an updated document (->writeHTML). It can not only parse malformed HTML, but also various XML dialects (with namespaces), and even extract data from HTML microformats (XFN, vCard).

 $qp->find("a[target=_blank]")->toggleClass("usability-blunder");

.

phpQuery or QueryPath?

Generally QueryPath is better suited for manipulation of documents. While phpQuery also implements some pseudo AJAX methods (just HTTP requests) to more closely resemble jQuery. It is said that phpQuery is often faster than QueryPath (because of fewer overall features).

For further information on the differences see this comparison on the wayback machine from tagbyte.org. (Original source went missing, so here's an internet archive link. Yes, you can still locate missing pages, people.)

Advantages

  • Simplicity and Reliability
  • Simple to use alternatives ->find("a img, a object, div a")
  • Proper data unescaping (in comparison to regular expression grepping)
烟织青萝梦 2024-09-25 00:33:15

Simple HTML DOM 是一个很棒的开源解析器:

simplehtmldom.sourceforge

它以面向对象的方式处理 DOM 元素,并且新的迭代对不合规代码进行了大量覆盖。还有一些很棒的函数,就像您在 JavaScript 中看到的那样,例如“查找”函数,它将返回该标记名称的元素的所有实例。

我已经在许多工具中使用了它,在许多不同类型的网页上测试了它,我认为它效果很好。

Simple HTML DOM is a great open-source parser:

simplehtmldom.sourceforge

It treats DOM elements in an object-oriented way, and the new iteration has a lot of coverage for non-compliant code. There are also some great functions like you'd see in JavaScript, such as the "find" function, which will return all instances of elements of that tag name.

I've used this in a number of tools, testing it on many different types of web pages, and I think it works great.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文