使用 PHP DOM 函数从 HTML 文件中提取数据的最佳方法是什么?

发布于 2024-10-26 06:23:05 字数 305 浏览 2 评论 0原文

我需要从各种 HTML 文件中提取大量数据,并且我必须为每种类型的 HTML 文件编写单独的脚本,以便正确解析出我需要的数据。

数据将位于文档的不同部分 - 例如,在文档类型一中,我需要的数据可能很好地位于带有 ID 的 DIV 内,但在文档类型二中,定位我需要的数据的唯一方法可能是通过查找包含它的特定标签模式(例如

DATA
)。

从到目前为止我所能找到的一点来看,DOMXPath 似乎能够帮助我至少进行一些提取 - 我还可以使用哪些其他功能,特别是在定位任意标签模式的第二个示例中获取他们的内容?

I need to extract large amounts of data from a variety of HTML files, and I will have to write a separate script for each type of HTML file in order to parse out the data I need correctly.

The data will be located in different parts of the document - for example, in document type one, the data I need may be nicely inside a DIV with an ID, but on document type two the only way to locate the data I need may be by finding the certain pattern of tags that contains it (like <div><b>DATA</div></b>).

From the little I've been able to find so far it seems that DOMXPath may be able to help me with at least some of the extraction - what other functions can I use, specifically on the second example of locating an arbitrary pattern of tags and getting their content?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

长不大的小祸害 2024-11-02 06:23:05

如果您要从各种 HTML 文件中提取不同类型的数据,那么您很快就会厌倦使用 DOMDocument API 和 XPath。使用 如何在 PHP 中解析和处理 HTML/XML?。它们提供了更丰富的 API 和额外的选择器。

我更喜欢 phpQuery 和 QueryPath ,它们允许:

print qp($url)->find("body p.article a")->attr("href");

print qp($html)->find("div b")->text();

可用的函数记录在此处:http://api.querypath.org/docs/class_query_path.html - 它主要类似于 jQuery。

If you are extracting different types of data from a variety of HTML files, you are going to tire quickly from using the DOMDocument API and XPath. Use one of the wrapper libraries listed in How do you parse and process HTML/XML in PHP?. They provide a richer API and additional selectors.

I'm preferring phpQuery and QueryPath which allow for:

print qp($url)->find("body p.article a")->attr("href");

print qp($html)->find("div b")->text();

The usable functions are documented here: http://api.querypath.org/docs/class_query_path.html - it's mostly like jQuery.

回忆躺在深渊里 2024-11-02 06:23:05

如果您计划解析许多 HTML 文件并且需要选择或修改 HTML 文件的许多元素,请考虑使用库。

我会推荐这个库 PHPPowertools/DOM-Query< /a>,我自己写的。它允许您 (1) 加载 HTML 文件,然后 (2) 选择或更改 HTML 的部分内容,这与您在前端应用程序中使用 jQuery 的方式几乎相同。

使用示例:

// Select the body tag
$body = $H->select('body');

// Combine different classes as one selector to get all site blocks
$siteblocks = $body->select('.site-header, .masthead, .site-body, .site-footer');

// Nest your methods just like you would with jQuery
$siteblocks->select('button')->add('span')->addClass('icon icon-printer');

// Use a lambda function to set the text of all site blocks
$siteblocks->text(function($i, $val) {
    return $i . " - " . $val->attr('class');
});

// Append the following HTML to all site blocks
$siteblocks->append('<div class="site-center"></div>');

// Use a descendant selector to select the site's footer
$sitefooter = $body->select('.site-footer > .site-center');

// Set some attributes for the site's footer
$sitefooter->attr(array('id' => 'aweeesome', 'data-val' => 'see'));

// Use a lambda function to set the attributes of all site blocks
$siteblocks->attr('data-val', function($i, $val) {
    return $i . " - " . $val->attr('class') . " - photo by Kelly Clark";
});

// Select the parent of the site's footer
$sitefooterparent = $sitefooter->parent();

// Remove the class of all i-tags within the site's footer's parent
$sitefooterparent->select('i')->removeAttr('class');

// Wrap the site's footer within two nex selectors
$sitefooter->wrap('<section><div class="footer-wrapper"></div></section>');

[...]

If you plan on parsing many HTML files and you need to select or modify many elements of your HTML files, consider using a library.

I would recommend the library PHPPowertools/DOM-Query, which I wrote myself. It allows you to (1) load an HTML file and then (2) select or change parts of your HTML pretty much the same way you'd do it if you'd be using jQuery in a frontend app.

Example use :

// Select the body tag
$body = $H->select('body');

// Combine different classes as one selector to get all site blocks
$siteblocks = $body->select('.site-header, .masthead, .site-body, .site-footer');

// Nest your methods just like you would with jQuery
$siteblocks->select('button')->add('span')->addClass('icon icon-printer');

// Use a lambda function to set the text of all site blocks
$siteblocks->text(function($i, $val) {
    return $i . " - " . $val->attr('class');
});

// Append the following HTML to all site blocks
$siteblocks->append('<div class="site-center"></div>');

// Use a descendant selector to select the site's footer
$sitefooter = $body->select('.site-footer > .site-center');

// Set some attributes for the site's footer
$sitefooter->attr(array('id' => 'aweeesome', 'data-val' => 'see'));

// Use a lambda function to set the attributes of all site blocks
$siteblocks->attr('data-val', function($i, $val) {
    return $i . " - " . $val->attr('class') . " - photo by Kelly Clark";
});

// Select the parent of the site's footer
$sitefooterparent = $sitefooter->parent();

// Remove the class of all i-tags within the site's footer's parent
$sitefooterparent->select('i')->removeAttr('class');

// Wrap the site's footer within two nex selectors
$sitefooter->wrap('<section><div class="footer-wrapper"></div></section>');

[...]
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文