Beautifulsoup找不到“ H3;”标签
这个问题的URL是: https://www.empireonline.com/电影/功能/Best-Movies-2/ 如您所见,其中存在H3标签,但美丽的肥皂不会打印H3标签。
The URL in this question is : https://www.empireonline.com/movies/features/best-movies-2/
As you can see the h3 tags are present in it but the beautiful soap don't print the h3 tag.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
所有信息都位于HTML内部,该信息在
< script>
包含JSON数据的标签中返回。然后通常通过JavaScript将其转换为HTML,但是您仍然可以使用BeautifulSoup来找到标签,然后可以将其提取,然后使用Python的JSON库将所有数据转换为Python结构。
例如:
困难的部分是在数据结构中找到所需的信息。我建议您打印
数据
,并仔细查看。这将使您开始输出:
All of the information is inside the HTML that is returned inside a
<script>
tag containing JSON data.It is then usually converted into HTML by Javascript, but you can still extract it using BeautifulSoup to find the tag and then Python's JSON library to convert all the data into a Python structure.
For example:
The hard part is finding the information you want inside the data structure. I suggest you print
data
and have a closer look.This would give you output starting:
您不能静态地刮擦该网站,因为其中一些网站是动态渲染的,也就是说,仅在浏览器执行JavaScript代码后才可以使用其某些内容(包括
H3
标签)。这在使用现代网络框架(例如React)的网站中很常见(这里是这种情况)。要解决此问题,您应该使用能够运行站点脚本的刮擦工具,例如 /代码> 。
You can't statically scrape that website because some of it is rendered dynamically, that is, some of its contents (including the
h3
tags) are available only after your browser executes JavaScript code. This is common in sites that use modern web frameworks, like React (which is the case here).To solve this, you should use a scraping tool that is capable of running a site's scripts, like
selenium
.