如何从注释块中刮擦数据并创建数据框架？

发布于 2025-02-02 03:51:25 字数 1539 浏览 1 评论 0原文

我正在尝试从baseball-reference.com获取HTML数据。我认为要访问他们的网站，查看页面源，HTML标签将在HTML代码本身内。但是，经过进一步的调查，我关心的HTML标签集在评论块中。

示例： https://wwww.baseballe-reference.com/联赛/al/2021-标准击tmtting.shtml 通过“查看源代码”查找标签：

<div class="table_container" id="div_players_standard_batting">

我正在寻找的代码在此行以下。而且，如果您在此行上方看，您将看到注释块start＆lt;！ - 并且直到HTML文件的末尾才结束。

我可以使用以下代码来提取HTML注释，但它带有一些问题。

的数据，
它在列表中，我只关心拥有它带有新行标签
我正在努力地使用播放器标准击球字符串代码，并将其作为HTML代码重新使用，以使用Beautifutsoup来获取我想要的数据。

代码：

from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd
import os.path
import requests

r = requests.get("https://www.baseball-reference.com/leagues/majors/2021-standard-batting.shtml")
soup = BeautifulSoup(r.content, "html.parser") # try lxml

Data=[x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment))]
Data

当前环境设置：

dependencies:
  - python=3.9.7
  - beautifulsoup4=4.11.1
  - jupyterlab=3.3.2
  - pandas=1.4.2
  - pyodbc=4.0.32

最终目标：能够拥有来自此网页中每个播放器数据的PANDAS数据框架。

编辑：

答案：

实现我的目标的更改：通过Anaconda提示将LXML软件包安装到我的环境中。使用以下代码将我的HTML数据提取到数据框中（提供：刺猬 - 谢谢！）

pd.read_html([x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_players_standard_batting"' in x][0])[0]

原文

I am trying to pull HTML data from baseball-reference.com. I thought going to their website, viewing the page source, the html tags would be within the html code itself. However, after further investigation, the set of html tags that I care about are within comment blocks.

Example: https://www.baseball-reference.com/leagues/AL/2021-standard-batting.shtml
Find the tag by "Viewing Source Code":

<div class="table_container" id="div_players_standard_batting">

The code I am looking for is below this line. And if you look above this line, you will see the comment block start <!-- and doesn't end until almost the end of the HTML file.

I can pull the HTML comments with the following code, but it comes with a few issues.

It is in a list and I care only about the one that has the data
It comes with new line tags
I am struggling on how to take the players standard batting string code and reparse it as html code to use BeautifulSoup to grab the data I want.

Code:

from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd
import os.path
import requests

r = requests.get("https://www.baseball-reference.com/leagues/majors/2021-standard-batting.shtml")
soup = BeautifulSoup(r.content, "html.parser") # try lxml

Data=[x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment))]
Data

Current Environment Settings:

dependencies:
  - python=3.9.7
  - beautifulsoup4=4.11.1
  - jupyterlab=3.3.2
  - pandas=1.4.2
  - pyodbc=4.0.32

The end goal:
Be able to have a pandas dataframe that has each player's data from this web page.

EDIT:

ANSWER:

Changes made to get to my goal:
Installed the lxml package via Anaconda Prompt into my environment.
Used the following line of code to pull my html data into a dataframe (Provided by: HedgeHog - Thank You!)

pd.read_html([x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_players_standard_batting"' in x][0])[0]

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

梦言归人 2025-02-09 03:51:25

您在正确的轨道上，只需将各个部分放在一起即可。

在resultset中，只有一个元素，ID div_players_standard_batting，因此请过滤并使用此元素使用pandas.pandas.read_html（）到dataFrame：

pd.read_html([x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_players_standard_batting"' in x][0])[0]

或者作为替代方案创建一个新的bs4对象并在其行上迭代：

soup = BeautifulSoup([x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_players_standard_batting"' in x][0])
for row in soup.select('table tr'):
    ...

输出：

	RK	名称	年龄	TM	LG	G	PA	AB	R	H	2B	3B	HR	RBI	SB	CS	BB	SO	BA	OBP	SLG	OPS OPS	OPS+	TB	GDP	HBP HBP	SH	SF	IBB IBB	POS摘要
0	1	Fernando Abad*	35	BAL	AL	2	0	0	0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0	0 0	0 0 0	0 0 0	0 0 0 0	0	0 0	00	士伯0	0 0	Nan	Nan	0	Nan	Nan Nan	0	0	25	NL	1	1	1
1	2	Cory Abbott	2	CHC	Nan Nan	8	3	3	0	1	0 0	0	0	0	0	0 0 0 0 0 0 0 0 0	0	1H	0.333	0.333	0.333	0.667	81	1	0 0 0	0	0 0	/	0 0	2
3	Albert	3 Albert Abreu	25	Nyy	Al	3	0 0	0	0	0	0	0	0	0	0	0	0	0	NAN	NAN	NAN	0	NAN NAN NAN	0	0 0	0	0	0 0 0	0	1
3	4	BRYAN ABREU	24	HOU	AL	1	0	0	0 0 0 0 0 0 0 0 0	0 0	0	0	0	0	0 0	0	0	0	NAN NAN	NAN NAN NAN Nan Nan	Nan	Nan	0	0	0	0	0	0	0	4
5JoséAbreu34	61	Chw	2	Al	152	659	566	86	148	30	2	30	30	117	1	0	143	1	0.261	0.351 0.481	0.481	0.831	125 272	272 272	28 22	28 22 2 22	0	10 10	3	*3DD/5
...	Nan ..	...	...	...	...	...	...	... ...	...	... ...	...	... ...	... ...	... ... ...	...	...	...	…………………………………………………………………	2	bal	*	布鲁斯	4	齐默尔曼	4	·	26	al
									0	0	0	0	0	0	0 0	0	0	0	1	0	1	0	35	1	1	0	0	0 0 0-100 0 0 0 0 0	0	0
1788	1721	Jordan Zimmermann	Mil	NL	2	1	0 0 0	0 0 0 0	0 0-100	0	0 0	0	0 0	0	0 0 0	0	0	0 0	0 0	0	0	0	-100	0	0	0	0	0	0	/1
1789	1722	TYLER ZUBER	26	KCR	AL	1	1	1	0	0	0 0	0 0 0 0 0 0	0 0	0	0	0	0 0	0 0 0	0	0	0	0	0	0 0	0 0	0	0 0	0 0	0	1
1790	1723	MIKE ZUNINO	30	TBR	AL	109	375	333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333	64	72	11	2	33	62	0	0	34	132	0.216	0.301	0.559	0.86	137	186	7	7	0	1	0	2/H
1791	nan	LgAvg per 600 PA	nan	nan	nan	205	600	535	73	130	26	2	20	69	7	2	52	139	0.243	0.316	0.41	0.726	NAN	219	11	7	2	4	2	NAN

编辑

要摆脱不需要的行，在RK列中排除 nan 和 rk 值：

df1 = df1[(~df1.Rk.isna()) & (df1.Rk != 'Rk')]

You are on the right track, you just have to put the individual parts together.

In the ResultSet there should be only one element with id div_players_standard_batting, so filter for it and take this element to transform it with pandas.read_html() to a DataFrame:

pd.read_html([x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_players_standard_batting"' in x][0])[0]

or as alternative create a new bs4 object and iterate over its rows:

soup = BeautifulSoup([x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_players_standard_batting"' in x][0])
for row in soup.select('table tr'):
    ...

Output:

	Rk	Name	Age	Tm	Lg	G	PA	AB	R	H	2B	3B	HR	RBI	SB	CS	BB	SO	BA	OBP	SLG	OPS	OPS+	TB	GDP	HBP	SH	SF	IBB	Pos Summary
0	1	Fernando Abad*	35	BAL	AL	2	0	0	0	0	0	0	0	0	0	0	0	0	nan	nan	nan	nan	nan	0	0	0	0	0	0	1
1	2	Cory Abbott	25	CHC	NL	8	3	3	0	1	0	0	0	0	0	0	0	1	0.333	0.333	0.333	0.667	81	1	0	0	0	0	0	/1H
2	3	Albert Abreu	25	NYY	AL	3	0	0	0	0	0	0	0	0	0	0	0	0	nan	nan	nan	nan	nan	0	0	0	0	0	0	1
3	4	Bryan Abreu	24	HOU	AL	1	0	0	0	0	0	0	0	0	0	0	0	0	nan	nan	nan	nan	nan	0	0	0	0	0	0	1
4	5	José Abreu	34	CHW	AL	152	659	566	86	148	30	2	30	117	1	0	61	143	0.261	0.351	0.481	0.831	125	272	28	22	0	10	3	*3D/5
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1787	1720	Bruce Zimmermann*	26	BAL	AL	2	4	4	0	0	0	0	0	0	0	0	0	3	0	0	0	0	-100	0	0	0	0	0	0	1
1788	1721	Jordan Zimmermann	35	MIL	NL	2	1	1	0	0	0	0	0	0	0	0	0	1	0	0	0	0	-100	0	0	0	0	0	0	/1
1789	1722	Tyler Zuber	26	KCR	AL	1	1	1	0	0	0	0	0	0	0	0	0	1	0	0	0	0	-100	0	0	0	0	0	0	1
1790	1723	Mike Zunino	30	TBR	AL	109	375	333	64	72	11	2	33	62	0	0	34	132	0.216	0.301	0.559	0.86	137	186	7	7	0	1	0	2/H
1791	nan	LgAvg per 600 PA	nan	nan	nan	205	600	535	73	130	26	2	20	69	7	2	52	139	0.243	0.316	0.41	0.726	nan	219	11	7	2	4	2	nan

EDIT

To get rid of unwanted rows, exclude in column Rk the NaN and Rk values:

df1 = df1[(~df1.Rk.isna()) & (df1.Rk != 'Rk')]

回复收藏 0 原文

情何以堪。 2025-02-09 03:51:25

首先提取RAW HTML，然后使用REGEX使用str.replace删除注释。然后用BeautifulSoup4对其进行解析。我认为这会解决问题

回复收藏 0 原文

~没有更多了~

关于作者

橘虞初梦

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

如何从注释块中刮擦数据并创建数据框架？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

编辑

EDIT

关于作者

相关话题

热门标签

推荐作者

5040234068

樱花雨梦

≈。彩虹

雨轻弹

血之狂魔

qq_0bIjwE

友情链接

如何从注释块中刮擦数据并创建数据框架？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

编辑

EDIT

关于作者

相关话题

热门标签

推荐作者

5040234068

樱花雨梦

≈。彩虹

雨轻弹

血之狂魔

qq_0bIjwE

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。