如何从注释块中刮擦数据并创建数据框架?

发布于 2025-02-02 03:51:25 字数 1539 浏览 1 评论 0原文

我正在尝试从baseball-reference.com获取HTML数据。我认为要访问他们的网站,查看页面源,HTML标签将在HTML代码本身内。但是,经过进一步的调查,我关心的HTML标签集在评论块中。

示例: https://wwww.baseballe-reference.com/联赛/al/2021-标准击tmtting.shtml 通过“查看源代码”查找标签:

<div class="table_container" id="div_players_standard_batting">

我正在寻找的代码在此行以下。而且,如果您在此行上方看,您将看到注释块start&lt;! - 并且直到HTML文件的末尾才结束。

我可以使用以下代码来提取HTML注释,但它带有一些问题。

  1. 的数据,
  2. 它在列表中,我只关心拥有它带有新行标签
  3. 我正在努力地使用播放器标准击球字符串代码,并将其作为HTML代码重新使用,以使用Beautifutsoup来获取我想要的数据。

代码:

from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd
import os.path
import requests

r = requests.get("https://www.baseball-reference.com/leagues/majors/2021-standard-batting.shtml")
soup = BeautifulSoup(r.content, "html.parser") # try lxml

Data=[x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment))]
Data

当前环境设置:

dependencies:
  - python=3.9.7
  - beautifulsoup4=4.11.1
  - jupyterlab=3.3.2
  - pandas=1.4.2
  - pyodbc=4.0.32

最终目标: 能够拥有来自此网页中每个播放器数据的PANDAS数据框架。

编辑:

答案:

实现我的目标的更改: 通过Anaconda提示将LXML软件包安装到我的环境中。 使用以下代码将我的HTML数据提取到数据框中(提供:刺猬 - 谢谢!)

pd.read_html([x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_players_standard_batting"' in x][0])[0]

I am trying to pull HTML data from baseball-reference.com. I thought going to their website, viewing the page source, the html tags would be within the html code itself. However, after further investigation, the set of html tags that I care about are within comment blocks.

Example: https://www.baseball-reference.com/leagues/AL/2021-standard-batting.shtml
Find the tag by "Viewing Source Code":

<div class="table_container" id="div_players_standard_batting">

The code I am looking for is below this line. And if you look above this line, you will see the comment block start <!-- and doesn't end until almost the end of the HTML file.

I can pull the HTML comments with the following code, but it comes with a few issues.

  1. It is in a list and I care only about the one that has the data
  2. It comes with new line tags
  3. I am struggling on how to take the players standard batting string code and reparse it as html code to use BeautifulSoup to grab the data I want.

Code:

from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd
import os.path
import requests

r = requests.get("https://www.baseball-reference.com/leagues/majors/2021-standard-batting.shtml")
soup = BeautifulSoup(r.content, "html.parser") # try lxml

Data=[x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment))]
Data

Current Environment Settings:

dependencies:
  - python=3.9.7
  - beautifulsoup4=4.11.1
  - jupyterlab=3.3.2
  - pandas=1.4.2
  - pyodbc=4.0.32

The end goal:
Be able to have a pandas dataframe that has each player's data from this web page.

EDIT:

ANSWER:

Changes made to get to my goal:
Installed the lxml package via Anaconda Prompt into my environment.
Used the following line of code to pull my html data into a dataframe (Provided by: HedgeHog - Thank You!)

pd.read_html([x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_players_standard_batting"' in x][0])[0]

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

梦言归人 2025-02-09 03:51:25

您在正确的轨道上,只需将各个部分放在一起即可。

resultset中,只有一个元素,ID div_players_standard_batting,因此请过滤并使用此元素使用pandas.pandas.read_html()到dataFrame:

pd.read_html([x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_players_standard_batting"' in x][0])[0]

或者作为替代方案创建一个新的bs4对象并在其行上迭代:

soup = BeautifulSoup([x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_players_standard_batting"' in x][0])
for row in soup.select('table tr'):
    ...

输出:

RK名称年龄TMLGGPAABRH2B3BHRRBISBCSBBSOBAOBPSLGOPS OPSOPS+TBGDPHBP HBPSHSFIBB IBBPOS摘要
01Fernando Abad*35BALAL2000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 00 0 00 0 00 0 0 000 000士伯00 0NanNan0NanNan Nan0025NL111
12Cory Abbott2CHCNan Nan833010 000000 0 0 0 0 0 0 0 001H0.3330.3330.3330.6678110 0 000 0/0 02
3Albert3 Albert Abreu25NyyAl30 000000000000NANNANNAN0NAN NAN NAN00 0000 0 001
34BRYAN ABREU24HOUAL1000 0 0 0 0 0 0 0 00 000000 0000NAN NANNAN NAN NAN Nan NanNanNan00000004
5JoséAbreu3461Chw2Al1526595668614830230301171014310.2610.351 0.4810.4810.831125 272272 27228 2228 22 2 22010 103*3DD/5
...Nan ....................... ......... ......... ...... ...... ... ............…………………………………………………………………2bal*布鲁斯4齐默尔曼4·26al
0000000 000010103511000 0 0-100 0 0 0 0 000
17881721Jordan ZimmermannMilNL210 0 00 0 0 00 0-10000 000 000 0 0000 00 0000-100000000/1
17891722TYLER ZUBER26KCRAL111000 00 0 0 0 0 00 00000 00 0 0000000 00 000 00 001
17901723MIKE ZUNINO30TBRAL1093753333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333336472112336200341320.2160.3010.5590.86137186770102/H
1791nanLgAvg per 600 PAnannannan20560053573130262206972521390.2430.3160.410.726NAN219117242NAN

编辑

要摆脱不需要的行,在RK列中排除 nan rk 值:

df1 = df1[(~df1.Rk.isna()) & (df1.Rk != 'Rk')]

You are on the right track, you just have to put the individual parts together.

In the ResultSet there should be only one element with id div_players_standard_batting, so filter for it and take this element to transform it with pandas.read_html() to a DataFrame:

pd.read_html([x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_players_standard_batting"' in x][0])[0]

or as alternative create a new bs4 object and iterate over its rows:

soup = BeautifulSoup([x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_players_standard_batting"' in x][0])
for row in soup.select('table tr'):
    ...

Output:

RkNameAgeTmLgGPAABRH2B3BHRRBISBCSBBSOBAOBPSLGOPSOPS+TBGDPHBPSHSFIBBPos Summary
01Fernando Abad*35BALAL2000000000000nannannannannan0000001
12Cory Abbott25CHCNL83301000000010.3330.3330.3330.66781100000/1H
23Albert Abreu25NYYAL3000000000000nannannannannan0000001
34Bryan Abreu24HOUAL1000000000000nannannannannan0000001
45José Abreu34CHWAL152659566861483023011710611430.2610.3510.4810.83112527228220103*3D/5
.............................................................................................
17871720Bruce Zimmermann*26BALAL24400000000030000-1000000001
17881721Jordan Zimmermann35MILNL21100000000010000-100000000/1
17891722Tyler Zuber26KCRAL11100000000010000-1000000001
17901723Mike Zunino30TBRAL1093753336472112336200341320.2160.3010.5590.86137186770102/H
1791nanLgAvg per 600 PAnannannan20560053573130262206972521390.2430.3160.410.726nan219117242nan

EDIT

To get rid of unwanted rows, exclude in column Rk the NaN and Rk values:

df1 = df1[(~df1.Rk.isna()) & (df1.Rk != 'Rk')]
情何以堪。 2025-02-09 03:51:25

首先提取RAW HTML,然后使用REGEX使用str.replace删除注释。然后用BeautifulSoup4对其进行解析。我认为这会解决问题

First pull raw html and then remove comments with str.replace using regex. Then parse it with beautifulsoup4. I think this will do the trick

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文