需要帮助抓取网页——获取特定内容...
我有一个表,其列数可以根据报废页面的配置而改变(我无法控制它)。我只想获取由列标题指定的特定列中的信息。
这是一个简化的表:
<table>
<tbody>
<tr class='header'>
<td>Image</td>
<td>Name</td>
<td>Time</td>
</tr>
<tr>
<td><img src='someimage.png' /></td>
<td>Name 1</td>
<td>13:02</td>
</tr>
<tr>
<td><img src='someimage.png' /></td>
<td>Name 2</td>
<td>13:43</td>
</tr>
<tr>
<td><img src='someimage.png' /></td>
<td>Name 3</td>
<td>14:53</td>
</tr>
</tbody>
</table>
我只想提取表的名称(第 2 列)。然而,如前所述,列顺序无法得知。例如,“图像”列可能不存在,在这种情况下,我想要的列将是第一个。
我想知道是否有任何方法可以使用 DomDocument
/DomXPath
来做到这一点。也许在第一个tr
中搜索字符串“Name”,找出它是哪个列索引,然后使用它来获取信息。一个不太优雅的解决方案是查看第一列是否有 img
标记,在这种情况下,图像列是第一个,因此我们可以抛出这种方式并使用下一个。
看了大约一个半小时,但我对DomDocument的功能和操作并不熟悉。这个问题有很多麻烦。
I have a table, of whose number of columns can change depending on the configuration of the scrapped page (I have no control of it). I want to get only the information from a specific column, designated by the columns heading.
Here is a simplified table:
<table>
<tbody>
<tr class='header'>
<td>Image</td>
<td>Name</td>
<td>Time</td>
</tr>
<tr>
<td><img src='someimage.png' /></td>
<td>Name 1</td>
<td>13:02</td>
</tr>
<tr>
<td><img src='someimage.png' /></td>
<td>Name 2</td>
<td>13:43</td>
</tr>
<tr>
<td><img src='someimage.png' /></td>
<td>Name 3</td>
<td>14:53</td>
</tr>
</tbody>
</table>
I want to only extract the names (column 2) of the table. However, as previously stated, the column order cannot be known. The Image column might not be there, for example, in which case the column I want would be the first one.
I was wondering if there's any way to do this with DomDocument
/DomXPath
. Perhaps search for the string "Name" in the first tr
, and find out which column index it is, and then use that to get the info. A less elegant solution would be to see if the first column has an img
tag, in which case the image column is first and so we can throw that way and use the next one.
Been looking at it for about an hour and a half, but I'm not familiar to DomDocument functions and manipulation. Having a lot of trouble with this one.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
简单的 HTML DOM 解析器可能很有用。你可以查一下说明书。基本上你应该使用类似的东西;
我们找到哪一列($num)是图像列。您可以添加额外的代码来改进。
PS:找到所有图像源的简单方法;
Simple HTML DOM Parser may be useful. You can check the manual. Basically you should use something like;
We found which column ($num) is image column. You can add additional codes to improve.
PS: Easy way to find all image sources;