如何从注释块中刮擦数据并创建数据框架?
我正在尝试从baseball-reference.com获取HTML数据。我认为要访问他们的网站,查看页面源,HTML标签将在HTML代码本身内。但是,经过进一步的调查,我关心的HTML标签集在评论块中。
示例: https://wwww.baseballe-reference.com/联赛/al/2021-标准击tmtting.shtml 通过“查看源代码”查找标签:
<div class="table_container" id="div_players_standard_batting">
我正在寻找的代码在此行以下。而且,如果您在此行上方看,您将看到注释块start&lt;! - 并且直到HTML文件的末尾才结束。
我可以使用以下代码来提取HTML注释,但它带有一些问题。
- 的数据,
- 它在列表中,我只关心拥有它带有新行标签
- 我正在努力地使用播放器标准击球字符串代码,并将其作为HTML代码重新使用,以使用Beautifutsoup来获取我想要的数据。
代码:
from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd
import os.path
import requests
r = requests.get("https://www.baseball-reference.com/leagues/majors/2021-standard-batting.shtml")
soup = BeautifulSoup(r.content, "html.parser") # try lxml
Data=[x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment))]
Data
当前环境设置:
dependencies:
- python=3.9.7
- beautifulsoup4=4.11.1
- jupyterlab=3.3.2
- pandas=1.4.2
- pyodbc=4.0.32
最终目标: 能够拥有来自此网页中每个播放器数据的PANDAS数据框架。
编辑:
答案:
实现我的目标的更改: 通过Anaconda提示将LXML软件包安装到我的环境中。 使用以下代码将我的HTML数据提取到数据框中(提供:刺猬 - 谢谢!)
pd.read_html([x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_players_standard_batting"' in x][0])[0]
I am trying to pull HTML data from baseball-reference.com. I thought going to their website, viewing the page source, the html tags would be within the html code itself. However, after further investigation, the set of html tags that I care about are within comment blocks.
Example: https://www.baseball-reference.com/leagues/AL/2021-standard-batting.shtml
Find the tag by "Viewing Source Code":
<div class="table_container" id="div_players_standard_batting">
The code I am looking for is below this line. And if you look above this line, you will see the comment block start <!-- and doesn't end until almost the end of the HTML file.
I can pull the HTML comments with the following code, but it comes with a few issues.
- It is in a list and I care only about the one that has the data
- It comes with new line tags
- I am struggling on how to take the players standard batting string code and reparse it as html code to use BeautifulSoup to grab the data I want.
Code:
from bs4 import BeautifulSoup
from bs4 import Comment
import pandas as pd
import os.path
import requests
r = requests.get("https://www.baseball-reference.com/leagues/majors/2021-standard-batting.shtml")
soup = BeautifulSoup(r.content, "html.parser") # try lxml
Data=[x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment))]
Data
Current Environment Settings:
dependencies:
- python=3.9.7
- beautifulsoup4=4.11.1
- jupyterlab=3.3.2
- pandas=1.4.2
- pyodbc=4.0.32
The end goal:
Be able to have a pandas dataframe that has each player's data from this web page.
EDIT:
ANSWER:
Changes made to get to my goal:
Installed the lxml package via Anaconda Prompt into my environment.
Used the following line of code to pull my html data into a dataframe (Provided by: HedgeHog - Thank You!)
pd.read_html([x.extract() for x in soup.find_all(string=lambda text: isinstance(text, Comment)) if 'id="div_players_standard_batting"' in x][0])[0]
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您在正确的轨道上,只需将各个部分放在一起即可。
在
resultset
中,只有一个元素,IDdiv_players_standard_batting
,因此请过滤并使用此元素使用pandas.pandas.read_html()到dataFrame:
或者作为替代方案创建一个新的
bs4对象
并在其行上迭代:输出:
编辑
要摆脱不需要的行,在RK列中排除 nan 和 rk 值:
You are on the right track, you just have to put the individual parts together.
In the
ResultSet
there should be only one element with iddiv_players_standard_batting
, so filter for it and take this element to transform it withpandas.read_html()
to a DataFrame:or as alternative create a new
bs4 object
and iterate over its rows:Output:
EDIT
To get rid of unwanted rows, exclude in column Rk the NaN and Rk values:
首先提取RAW HTML,然后使用REGEX使用
str.replace
删除注释。然后用BeautifulSoup4
对其进行解析。我认为这会解决问题First pull raw html and then remove comments with
str.replace
using regex. Then parse it withbeautifulsoup4
. I think this will do the trick