如何在不使用其数据框架索引的情况下用熊猫刮擦特定的表？

发布于 2025-02-10 20:08:17 字数 2905 浏览 1 评论 0原文

我目前正在尝试使用大熊猫刮擦HTML表，并尝试使用美丽的小组，但正在遇到问题。

这是url： https://ciffc.net/en/en/en/ciffc/ext/成员/sitrep/

由于本质上的页面是动态的，并且使用PD DataFrame的索引不是一个选项。也就是说，这是我希望使用今天的7个表索引从表中提取的输出。

display(df[7].iloc[1,2])

>> 'Yukon is at a level 3 prep level - but will trend upwards with the forecasted hot and dry weather.'

我没有这个问题用标题刮擦表，因为我可以使用pandas.read_html的匹配参数，但是该表没有标题。表中包含的数据也非常动态，我唯一能够识别为“注释”列的唯一独特元素。这是我试图识别这张表的尝试：

APLtable = pd.read_html(url, match='Comments')[0].head(14)
display(APLtable)

不幸的是，这尚不可行，告诉我有以下错误，

ValueError: No tables found matching pattern 'Comments'

我也尝试使用Beautifutsoup而没有成功，并且想知道是否有人知道某种方法可以参考该特定表的特殊性。网页。

这是所讨论的HTML表：

</div></div><div id="section-apl" class="section-wrapper" data-title="E: Preparedness Levels"><div id="apl_table_wrapper"><table class="sticky-enabled">
 <thead><tr><th class="">Agency</th><th title="Agency Preparedness Level" class=" tooltip">APL</th><th class="">Comments</th> </tr></thead>
<tbody>
 <tr id="apl-table-row-0" class="odd"><td>BC</td><td>1</td><td></td> </tr>
 <tr id="apl-table-row-1" class="even"><td>YT</td><td>3</td><td>Yukon is at a level 3 prep level - but will trend upwards with the forecasted hot and dry weather.</td> </tr>
 <tr id="apl-table-row-2" class="odd"><td>AB</td><td>2</td><td></td> </tr>
 <tr id="apl-table-row-3" class="even"><td>SK</td><td>1</td><td></td> </tr>
 <tr id="apl-table-row-4" class="odd"><td>MB</td><td>1</td><td></td> </tr>
 <tr id="apl-table-row-5" class="even"><td>ON</td><td>1</td><td></td> </tr>
 <tr id="apl-table-row-6" class="odd"><td>QC</td><td>1</td><td></td> </tr>
 <tr id="apl-table-row-7" class="even"><td>NL</td><td>2</td><td></td> </tr>
 <tr id="apl-table-row-8" class="odd"><td>NB</td><td>1</td><td></td> </tr>
 <tr id="apl-table-row-9" class="even"><td>NS</td><td>1</td><td></td> </tr>
 <tr id="apl-table-row-10" class="odd"><td>PE</td><td>1</td><td></td> </tr>
 <tr id="apl-table-row-11" class="even"><td>PC</td><td>1</td><td></td> </tr>
</tbody>
</table>

原文

I am currently trying to scrape an html table using pandas and tried using BeautifulSoup as well but am running into an issue doing so.

Here is the url: https://ciffc.net/en/ciffc/ext/member/sitrep/

Since the page is dynamic in nature and tables get added or removed daily, using the index of the pd dataframe is not an option. That said, here is the output I am looking to pull from the table using today's table index of 7 as an example.

display(df[7].iloc[1,2])

>> 'Yukon is at a level 3 prep level - but will trend upwards with the forecasted hot and dry weather.'

I don't have this issue scraping tables with a caption as I can use the match parameter of pandas.read_html, but this table doesn't have a caption. The data contained within the table is also very dynamic, with the only unique element I have been able to identify being the "Comments" column. Here is my attempt at identifying this table:

APLtable = pd.read_html(url, match='Comments')[0].head(14)
display(APLtable)

Unfortunately this hasn't worked, telling me there is the following error

ValueError: No tables found matching pattern 'Comments'

I have also tried using BeautifulSoup without success and was wondering if anyone would know a way to refer to that specific table given the particularities of the webpage.

Here is the html table in question:

</div></div><div id="section-apl" class="section-wrapper" data-title="E: Preparedness Levels"><div id="apl_table_wrapper"><table class="sticky-enabled">
 <thead><tr><th class="">Agency</th><th title="Agency Preparedness Level" class=" tooltip">APL</th><th class="">Comments</th> </tr></thead>
<tbody>
 <tr id="apl-table-row-0" class="odd"><td>BC</td><td>1</td><td></td> </tr>
 <tr id="apl-table-row-1" class="even"><td>YT</td><td>3</td><td>Yukon is at a level 3 prep level - but will trend upwards with the forecasted hot and dry weather.</td> </tr>
 <tr id="apl-table-row-2" class="odd"><td>AB</td><td>2</td><td></td> </tr>
 <tr id="apl-table-row-3" class="even"><td>SK</td><td>1</td><td></td> </tr>
 <tr id="apl-table-row-4" class="odd"><td>MB</td><td>1</td><td></td> </tr>
 <tr id="apl-table-row-5" class="even"><td>ON</td><td>1</td><td></td> </tr>
 <tr id="apl-table-row-6" class="odd"><td>QC</td><td>1</td><td></td> </tr>
 <tr id="apl-table-row-7" class="even"><td>NL</td><td>2</td><td></td> </tr>
 <tr id="apl-table-row-8" class="odd"><td>NB</td><td>1</td><td></td> </tr>
 <tr id="apl-table-row-9" class="even"><td>NS</td><td>1</td><td></td> </tr>
 <tr id="apl-table-row-10" class="odd"><td>PE</td><td>1</td><td></td> </tr>
 <tr id="apl-table-row-11" class="even"><td>PC</td><td>1</td><td></td> </tr>
</tbody>
</table>

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

隱形的亼 2025-02-17 20:08:17

表是恕我直言，实际上是静态的，我会尝试的：

import requests
from bs4 import BeautifulSoup

import pandas as pd

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0",
}

soup = (
    BeautifulSoup(
        requests.get(
            "https://ciffc.net/en/ciffc/ext/member/sitrep/",
            headers=headers,
        ).text,
        "lxml",
    ).find("div", {"data-title": "E: Preparedness Levels"})
)

df = pd.read_html(str(soup), flavor="lxml")[0]
print(df)

这应该一致输出：

   Agency  APL                                           Comments
0      BC    1                                                NaN
1      YT    3  Yukon is at a level 3 prep level - but will tr...
2      AB    2                                                NaN
3      SK    1                                                NaN
4      MB    1                                                NaN
5      ON    1                                                NaN
6      QC    1                                                NaN
7      NL    2                                                NaN
8      NB    1                                                NaN
9      NS    1                                                NaN
10     PE    1                                                NaN
11     PC    1                                                NaN

The tables are, IMHO, actually static and I'd try this:

import requests
from bs4 import BeautifulSoup

import pandas as pd

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:101.0) Gecko/20100101 Firefox/101.0",
}

soup = (
    BeautifulSoup(
        requests.get(
            "https://ciffc.net/en/ciffc/ext/member/sitrep/",
            headers=headers,
        ).text,
        "lxml",
    ).find("div", {"data-title": "E: Preparedness Levels"})
)

df = pd.read_html(str(soup), flavor="lxml")[0]
print(df)

This should consistently output:

   Agency  APL                                           Comments
0      BC    1                                                NaN
1      YT    3  Yukon is at a level 3 prep level - but will tr...
2      AB    2                                                NaN
3      SK    1                                                NaN
4      MB    1                                                NaN
5      ON    1                                                NaN
6      QC    1                                                NaN
7      NL    2                                                NaN
8      NB    1                                                NaN
9      NS    1                                                NaN
10     PE    1                                                NaN
11     PC    1                                                NaN

回复收藏 0 原文

~没有更多了~