使用 DOMDocument 和 XPath 访问子 div

发布于 2024-09-19 18:55:14 字数 1952 浏览 7 评论 0原文

我正在构建一个供个人使用和学习目的的基本屏幕抓取工具，因此请不要发表诸如“您需要征求许可”等评论。

我尝试访问的数据结构如下：

<tr>
    <td>
        <div class="wrapper">
            <div class="randomDiv">
                <div class="divContent">
                    <div class="event">asd</div>
                    <div class="date">asd</div>
                    <div class="venue">asd</div>
                    <div class="state">asd</div>
                </div>
            </div>
        </div>
    </td>
</tr>

我正在尝试收集所有这些数据（因为给定页面上大约有 20 行）。

使用以下代码，我成功地收集了我需要的数据：

$remote = file_get_contents("linktoURL");

$doc = new DOMDocument();
$doc->preserveWhiteSpace = false;
$file = @$doc->loadHTML($remote);
$rows = $doc->getElementsByTagName('tr');
$xp = new DOMXpath($doc);

//initialize variables
$rows = array();

foreach($xp->query('//*[contains(@class, \'wrapper\')]', $doc) as $found) {
    echo "<pre>";
    print_r($found->nodeValue);
}

现在我的问题是，我将如何将所有这些数据存储到如下所示的关联数组中：

Array (
    [0] => Array
        (
            [Event] => Name
            [Date] => 12/12/12
            [Venue] => NameOfPlace
            [state] => state
        )

    [1] => Array
        (
            [Event] => Name
            [Date] => 12/12/12
            [Venue] => NameOfPlace
            [state] => state
        )

    [2] => Array
        (
            [Event] => Name
            [Date] => 12/12/12
            [Venue] => NameOfPlace
            [state] => state
        )

)

现在，想到的唯一解决方案是调用在 foreach 循环中对每个类名 //*[contains(@class, \'className\')] 进行 xpath 查询。

是否有通过 DOMDocument 和 XPath 更惯用的方法，我可以在其中创建上述数据的关联数组？

编辑：

我不限于使用 DOMDocument 和 XPath，如果有其他可能更简单的解决方案，请发布它们。

原文

I'm building a basic screen scraper for personal use and learning purposes, so please do not post comments like "You need to ask permission" etc.

The data I'm trying to access is structured as follows:

<tr>
    <td>
        <div class="wrapper">
            <div class="randomDiv">
                <div class="divContent">
                    <div class="event">asd</div>
                    <div class="date">asd</div>
                    <div class="venue">asd</div>
                    <div class="state">asd</div>
                </div>
            </div>
        </div>
    </td>
</tr>

I'm attempting to gather all this data (as there are about 20 rows on the given page).

Using the following code I have managed to gather the data I need:

$remote = file_get_contents("linktoURL");

$doc = new DOMDocument();
$doc->preserveWhiteSpace = false;
$file = @$doc->loadHTML($remote);
$rows = $doc->getElementsByTagName('tr');
$xp = new DOMXpath($doc);

//initialize variables
$rows = array();

foreach($xp->query('//*[contains(@class, \'wrapper\')]', $doc) as $found) {
    echo "<pre>";
    print_r($found->nodeValue);
}

Now my question is, how would I go about storing all this data into an associative array like below:

Array (
    [0] => Array
        (
            [Event] => Name
            [Date] => 12/12/12
            [Venue] => NameOfPlace
            [state] => state
        )

    [1] => Array
        (
            [Event] => Name
            [Date] => 12/12/12
            [Venue] => NameOfPlace
            [state] => state
        )

    [2] => Array
        (
            [Event] => Name
            [Date] => 12/12/12
            [Venue] => NameOfPlace
            [state] => state
        )

)

Right now, the only solution that comes to mind would be to call the xpath query for each class name //*[contains(@class, \'className\')] in the foreach loop.

Is there a more idiomatic way via DOMDocument and XPath wherein I am able to create an associative array of the above data?

edit:

I'm not limited to using DOMDocument and XPath, if there are other solutions which might be easier, then please post them.

分享到QQ

分享到微博