如何在不使用其数据框架索引的情况下用熊猫刮擦特定的表?
我目前正在尝试使用大熊猫刮擦HTML表,并尝试使用美丽的小组,但正在遇到问题。
这是url: https://ciffc.net/en/en/en/ciffc/ext/成员/sitrep/
由于本质上的页面是动态的,并且使用PD DataFrame的索引不是一个选项。也就是说,这是我希望使用今天的7个表索引从表中提取的输出。
display(df[7].iloc[1,2])
>> 'Yukon is at a level 3 prep level - but will trend upwards with the forecasted hot and dry weather.'
我没有这个问题用标题刮擦表,因为我可以使用pandas.read_html的匹配参数,但是该表没有标题。表中包含的数据也非常动态,我唯一能够识别为“注释”列的唯一独特元素。这是我试图识别这张表的尝试:
APLtable = pd.read_html(url, match='Comments')[0].head(14)
display(APLtable)
不幸的是,这尚不可行,告诉我有以下错误,
ValueError: No tables found matching pattern 'Comments'
我也尝试使用Beautifutsoup而没有成功,并且想知道是否有人知道某种方法可以参考该特定表的特殊性。网页。
这是所讨论的HTML表:
</div></div><div id="section-apl" class="section-wrapper" data-title="E: Preparedness Levels"><div id="apl_table_wrapper"><table class="sticky-enabled">
<thead><tr><th class="">Agency</th><th title="Agency Preparedness Level" class=" tooltip">APL</th><th class="">Comments</th> </tr></thead>
<tbody>
<tr id="apl-table-row-0" class="odd"><td>BC</td><td>1</td><td></td> </tr>
<tr id="apl-table-row-1" class="even"><td>YT</td><td>3</td><td>Yukon is at a level 3 prep level - but will trend upwards with the forecasted hot and dry weather.</td> </tr>
<tr id="apl-table-row-2" class="odd"><td>AB</td><td>2</td><td></td> </tr>
<tr id="apl-table-row-3" class="even"><td>SK</td><td>1</td><td></td> </tr>
<tr id="apl-table-row-4" class="odd"><td>MB</td><td>1</td><td></td> </tr>
<tr id="apl-table-row-5" class="even"><td>ON</td><td>1</td><td></td> </tr>
<tr id="apl-table-row-6" class="odd"><td>QC</td><td>1</td><td></td> </tr>
<tr id="apl-table-row-7" class="even"><td>NL</td><td>2</td><td></td> </tr>
<tr id="apl-table-row-8" class="odd"><td>NB</td><td>1</td><td></td> </tr>
<tr id="apl-table-row-9" class="even"><td>NS</td><td>1</td><td></td> </tr>
<tr id="apl-table-row-10" class="odd"><td>PE</td><td>1</td><td></td> </tr>
<tr id="apl-table-row-11" class="even"><td>PC</td><td>1</td><td></td> </tr>
</tbody>
</table>
I am currently trying to scrape an html table using pandas and tried using BeautifulSoup as well but am running into an issue doing so.
Here is the url: https://ciffc.net/en/ciffc/ext/member/sitrep/
Since the page is dynamic in nature and tables get added or removed daily, using the index of the pd dataframe is not an option. That said, here is the output I am looking to pull from the table using today's table index of 7 as an example.
display(df[7].iloc[1,2])
>> 'Yukon is at a level 3 prep level - but will trend upwards with the forecasted hot and dry weather.'
I don't have this issue scraping tables with a caption as I can use the match parameter of pandas.read_html, but this table doesn't have a caption. The data contained within the table is also very dynamic, with the only unique element I have been able to identify being the "Comments" column. Here is my attempt at identifying this table:
APLtable = pd.read_html(url, match='Comments')[0].head(14)
display(APLtable)
Unfortunately this hasn't worked, telling me there is the following error
ValueError: No tables found matching pattern 'Comments'
I have also tried using BeautifulSoup without success and was wondering if anyone would know a way to refer to that specific table given the particularities of the webpage.
Here is the html table in question:
</div></div><div id="section-apl" class="section-wrapper" data-title="E: Preparedness Levels"><div id="apl_table_wrapper"><table class="sticky-enabled">
<thead><tr><th class="">Agency</th><th title="Agency Preparedness Level" class=" tooltip">APL</th><th class="">Comments</th> </tr></thead>
<tbody>
<tr id="apl-table-row-0" class="odd"><td>BC</td><td>1</td><td></td> </tr>
<tr id="apl-table-row-1" class="even"><td>YT</td><td>3</td><td>Yukon is at a level 3 prep level - but will trend upwards with the forecasted hot and dry weather.</td> </tr>
<tr id="apl-table-row-2" class="odd"><td>AB</td><td>2</td><td></td> </tr>
<tr id="apl-table-row-3" class="even"><td>SK</td><td>1</td><td></td> </tr>
<tr id="apl-table-row-4" class="odd"><td>MB</td><td>1</td><td></td> </tr>
<tr id="apl-table-row-5" class="even"><td>ON</td><td>1</td><td></td> </tr>
<tr id="apl-table-row-6" class="odd"><td>QC</td><td>1</td><td></td> </tr>
<tr id="apl-table-row-7" class="even"><td>NL</td><td>2</td><td></td> </tr>
<tr id="apl-table-row-8" class="odd"><td>NB</td><td>1</td><td></td> </tr>
<tr id="apl-table-row-9" class="even"><td>NS</td><td>1</td><td></td> </tr>
<tr id="apl-table-row-10" class="odd"><td>PE</td><td>1</td><td></td> </tr>
<tr id="apl-table-row-11" class="even"><td>PC</td><td>1</td><td></td> </tr>
</tbody>
</table>
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
表是恕我直言,实际上是静态的,我会尝试的:
这应该一致输出:
The tables are, IMHO, actually static and I'd try this:
This should consistently output: