在 PHP 中,如何抓取某个文本模式的 DOMDocument,然后获取该匹配文本的文本节点的父元素?

发布于 2024-10-20 17:25:32 字数 690 浏览 4 评论 0原文

我已经使用 PHP 和 cURL 构建了一个简单的网页抓取实用程序,并且一直使用这样的代码通过 ID 或标签名称(其中所需元素上不存在 ID)来抓取抓取页面的某些元素:

$dom = new DOMDocument();
@$dom->loadHTML($response);
$table = $dom->getElementsByTagName('table')->item(4);
$response = $dom->saveXML($table);

现在我已经陷入困境,我需要更进一步找到某个字符串或正则表达式文本模式的父元素,因为我需要从中收集数据的网站在 HTML 元素中没有任何 ID 或类需要从中提取数据,并且各个页面可能以不同的方式组织数据,所以我不能总是依赖表#X中的数据。从该站点获取我想要的数据的唯一可靠方法是通过其文本格式查找它,该格式始终是一个以“1”开头的数字列表。他们也不使用有序列表,或者会简单得多。它只是一个简单的表格单元格,其中的数字行由简单的
分隔。

所以我在想,如果我能找到“1.”,那么它的父元素将是表格单元格 ,找到它后,我需要提取其内容,也许该表格行中任何其他相邻表格单元格的内容。我在页面或 HTML 代码中找不到“1.”的其他实例,因此这种方法似乎是合理的,即使不是有点老套,但我离题了。

那么,处理此类问题的最佳方法是什么?

I've built a simple web scraping utility with PHP and cURL, and have been using code like this to grab certain elements of the scraped page by ID, or by Tag Name where no ID is present on the desired element:

$dom = new DOMDocument();
@$dom->loadHTML($response);
$table = $dom->getElementsByTagName('table')->item(4);
$response = $dom->saveXML($table);

Now I've run into a dilemma where I need to go one step further and find the parent element of a certain string or regex pattern of text, because the the site from which I need to collect data doesn't any IDs or classes in the HTML elements I need to extract data from, and various pages may have data organized in different ways, so I can't always rely on the data being in table #X. The only sure-fire way to get the data I'm after off this site is to look for it by its text format, which is always going to be a numeric list starting with "1. " They don't use ordered lists either, or it would be much simpler. It's just a simple table cell with numeric lines separated by a simple <br>.

So I was thinking, if I could find the "1. " then it's parent element would be the table cell <td> which, after finding it, then I would need to extract its content and perhaps the content of any other adjacent table cells in that table row. There are no other instances of "1. " that I could find in the page or the HTML code, so this approach seems reasonable, if not a bit hacky, but I digress.

So, what's the best way to approach something like this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

烟若柳尘 2024-10-27 17:25:32

您始终可以尝试如下所示的 XPath 查询(假设您要查找的内容始终位于表格单元格中)

$xpath = new DOMXPath($dom);
$cells = $xpath->query('//table/tr/td[contains(.,"1. ")]');
if ($cells->length > 0) {
    // get first item
    $cell = $cells->item(0);
    echo $cell->nodeValue; // text content only
    echo $dom->saveXML($cell); // <td>1. ... </td>
}

You could always try an XPath query like the following (assuming the content you're after is always in a table cell)

$xpath = new DOMXPath($dom);
$cells = $xpath->query('//table/tr/td[contains(.,"1. ")]');
if ($cells->length > 0) {
    // get first item
    $cell = $cells->item(0);
    echo $cell->nodeValue; // text content only
    echo $dom->saveXML($cell); // <td>1. ... </td>
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文