当前位置：文江博客话题详情

使用 PHP DOM 函数从 HTML 文件中提取数据的最佳方法是什么？

发布于 2024-10-26 06:23:05 字数 305 浏览 2 评论 0原文

我需要从各种 HTML 文件中提取大量数据，并且我必须为每种类型的 HTML 文件编写单独的脚本，以便正确解析出我需要的数据。

数据将位于文档的不同部分 - 例如，在文档类型一中，我需要的数据可能很好地位于带有 ID 的 DIV 内，但在文档类型二中，定位我需要的数据的唯一方法可能是通过查找包含它的特定标签模式（例如

DATA

）。

从到目前为止我所能找到的一点来看，DOMXPath 似乎能够帮助我至少进行一些提取 - 我还可以使用哪些其他功能，特别是在定位任意标签模式的第二个示例中获取他们的内容？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

长不大的小祸害 2024-11-02 06:23:05

如果您要从各种 HTML 文件中提取不同类型的数据，那么您很快就会厌倦使用 DOMDocument API 和 XPath。使用如何在 PHP 中解析和处理 HTML/XML？。它们提供了更丰富的 API 和额外的选择器。

我更喜欢 phpQuery 和 QueryPath ，它们允许：

print qp($url)->find("body p.article a")->attr("href");

print qp($html)->find("div b")->text();

可用的函数记录在此处：http://api.querypath.org/docs/class_query_path.html - 它主要类似于 jQuery。

If you are extracting different types of data from a variety of HTML files, you are going to tire quickly from using the DOMDocument API and XPath. Use one of the wrapper libraries listed in How do you parse and process HTML/XML in PHP?. They provide a richer API and additional selectors.

I'm preferring phpQuery and QueryPath which allow for:

print qp($url)->find("body p.article a")->attr("href");

print qp($html)->find("div b")->text();

The usable functions are documented here: http://api.querypath.org/docs/class_query_path.html - it's mostly like jQuery.

回复收藏 0 原文

回忆躺在深渊里 2024-11-02 06:23:05

如果您计划解析许多 HTML 文件并且需要选择或修改 HTML 文件的许多元素，请考虑使用库。

使用示例：

// Select the body tag
$body = $H->select('body');

// Combine different classes as one selector to get all site blocks
$siteblocks = $body->select('.site-header, .masthead, .site-body, .site-footer');

// Nest your methods just like you would with jQuery
$siteblocks->select('button')->add('span')->addClass('icon icon-printer');

// Use a lambda function to set the text of all site blocks
$siteblocks->text(function($i, $val) {
    return $i . " - " . $val->attr('class');
});

// Append the following HTML to all site blocks
$siteblocks->append('<div class="site-center"></div>');

// Use a descendant selector to select the site's footer
$sitefooter = $body->select('.site-footer > .site-center');

// Set some attributes for the site's footer
$sitefooter->attr(array('id' => 'aweeesome', 'data-val' => 'see'));

// Use a lambda function to set the attributes of all site blocks
$siteblocks->attr('data-val', function($i, $val) {
    return $i . " - " . $val->attr('class') . " - photo by Kelly Clark";
});

// Select the parent of the site's footer
$sitefooterparent = $sitefooter->parent();

// Remove the class of all i-tags within the site's footer's parent
$sitefooterparent->select('i')->removeAttr('class');

// Wrap the site's footer within two nex selectors
$sitefooter->wrap('<section><div class="footer-wrapper"></div></section>');

[...]

If you plan on parsing many HTML files and you need to select or modify many elements of your HTML files, consider using a library.

I would recommend the library PHPPowertools/DOM-Query, which I wrote myself. It allows you to (1) load an HTML file and then (2) select or change parts of your HTML pretty much the same way you'd do it if you'd be using jQuery in a frontend app.

Example use :

// Select the body tag
$body = $H->select('body');

// Combine different classes as one selector to get all site blocks
$siteblocks = $body->select('.site-header, .masthead, .site-body, .site-footer');

// Nest your methods just like you would with jQuery
$siteblocks->select('button')->add('span')->addClass('icon icon-printer');

// Use a lambda function to set the text of all site blocks
$siteblocks->text(function($i, $val) {
    return $i . " - " . $val->attr('class');
});

// Append the following HTML to all site blocks
$siteblocks->append('<div class="site-center"></div>');

// Use a descendant selector to select the site's footer
$sitefooter = $body->select('.site-footer > .site-center');

// Set some attributes for the site's footer
$sitefooter->attr(array('id' => 'aweeesome', 'data-val' => 'see'));

// Use a lambda function to set the attributes of all site blocks
$siteblocks->attr('data-val', function($i, $val) {
    return $i . " - " . $val->attr('class') . " - photo by Kelly Clark";
});

// Select the parent of the site's footer
$sitefooterparent = $sitefooter->parent();

// Remove the class of all i-tags within the site's footer's parent
$sitefooterparent->select('i')->removeAttr('class');

// Wrap the site's footer within two nex selectors
$sitefooter->wrap('<section><div class="footer-wrapper"></div></section>');

[...]

回复收藏 0 原文

~没有更多了~