需要帮助抓取网页——获取特定内容...

发布于 2024-11-26 23:23:36 字数 1062 浏览 1 评论 0原文

我有一个表,其列数可以根据报废页面的配置而改变(我无法控制它)。我只想获取由列标题指定的特定列中的信息。

这是一个简化的表:

<table>
<tbody>
<tr class='header'>
    <td>Image</td>
    <td>Name</td>
    <td>Time</td>
</tr>
<tr>
    <td><img src='someimage.png' /></td>
    <td>Name 1</td>
    <td>13:02</td>
</tr>
<tr>
    <td><img src='someimage.png' /></td>
    <td>Name 2</td>
    <td>13:43</td>
</tr>
<tr>
    <td><img src='someimage.png' /></td>
    <td>Name 3</td>
    <td>14:53</td>
</tr>
</tbody>
</table>

我只想提取表的名称(第 2 列)。然而,如前所述,列顺序无法得知。例如,“图像”列可能不存在,在这种情况下,我想要的列将是第一个。

我想知道是否有任何方法可以使用 DomDocument/DomXPath 来做到这一点。也许在第一个tr中搜索字符串“Name”,找出它是哪个列索引,然后使用它来获取信息。一个不太优雅的解决方案是查看第一列是否有 img 标记,在这种情况下,图像列是第一个,因此我们可以抛出这种方式并使用下一个。

看了大约一个半小时,但我对DomDocument的功能和操作并不熟悉。这个问题有很多麻烦。

I have a table, of whose number of columns can change depending on the configuration of the scrapped page (I have no control of it). I want to get only the information from a specific column, designated by the columns heading.

Here is a simplified table:

<table>
<tbody>
<tr class='header'>
    <td>Image</td>
    <td>Name</td>
    <td>Time</td>
</tr>
<tr>
    <td><img src='someimage.png' /></td>
    <td>Name 1</td>
    <td>13:02</td>
</tr>
<tr>
    <td><img src='someimage.png' /></td>
    <td>Name 2</td>
    <td>13:43</td>
</tr>
<tr>
    <td><img src='someimage.png' /></td>
    <td>Name 3</td>
    <td>14:53</td>
</tr>
</tbody>
</table>

I want to only extract the names (column 2) of the table. However, as previously stated, the column order cannot be known. The Image column might not be there, for example, in which case the column I want would be the first one.

I was wondering if there's any way to do this with DomDocument/DomXPath. Perhaps search for the string "Name" in the first tr, and find out which column index it is, and then use that to get the info. A less elegant solution would be to see if the first column has an img tag, in which case the image column is first and so we can throw that way and use the next one.

Been looking at it for about an hour and a half, but I'm not familiar to DomDocument functions and manipulation. Having a lot of trouble with this one.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

方圜几里 2024-12-03 23:23:36

简单的 HTML DOM 解析器可能很有用。你可以查一下说明书。基本上你应该使用类似的东西;

$url = "file url";
$html = file_get_html($url);
$header = $html->find('tr.header td');
$i = 0;
foreach ($header as $element){
 if ($element->innerText == 'Image') { $num = $i; }
 $i++;
}

我们找到哪一列($num)是图像列。您可以添加额外的代码来改进。

PS:找到所有图像源的简单方法;

$images = $html->find('tr td img');
foreach ($images as $image){
 $imageUrl[] = $image->src;
}

Simple HTML DOM Parser may be useful. You can check the manual. Basically you should use something like;

$url = "file url";
$html = file_get_html($url);
$header = $html->find('tr.header td');
$i = 0;
foreach ($header as $element){
 if ($element->innerText == 'Image') { $num = $i; }
 $i++;
}

We found which column ($num) is image column. You can add additional codes to improve.

PS: Easy way to find all image sources;

$images = $html->find('tr td img');
foreach ($images as $image){
 $imageUrl[] = $image->src;
}
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文