尝试使用 BeautifulSoup4 从网站上抓取文本,但什么也没发生

发布于 2025-01-11 19:16:01 字数 1357 浏览 0 评论 0原文

我想从这个网站抓取数据: https://playvalorant.com/en- us/news/game-updates/

from bs4 import BeautifulSoup
import requests

site_text = requests.get('https://playvalorant.com/en-us/news/game-updates/').text
soup = BeautifulSoup(site_text, 'lxml')
posts = soup.find_all('li', class_="ContentListing-module--contentListingItem--3GAoa")
for post in posts:
    post_title = post.find(
        'h3', class_="heading-05 bold ContentListingCard-module--title--1vIFy").text
    post_title = post_title.lower()
    if "patch notes" in post_title:
        patch_ver = post_title.replace('valorant patch notes ', '')
        print(f'Patch version: {patch_ver}')
        print("")

但是当我运行它时,什么也没有发生。

我想做的是查看 h3 是否包含文本“补丁说明”,如果是,检查它是什么版本并转到 https://playvalorant.com/en-us/news/game-updates/valorant-patch-notes-(patch-number)-(patch-number)/ (例如,如果文本是“VALORANT 补丁说明 3213.07”,那么我想去 https://playvalorant.com/en-us/ news/game-updates/valorant-patch-notes-3213-07 等。)

我有点超前了,但重点是,我怎样才能得到来自该网站的文本,然后打印出来?

I want to scrape data from this website: https://playvalorant.com/en-us/news/game-updates/

from bs4 import BeautifulSoup
import requests

site_text = requests.get('https://playvalorant.com/en-us/news/game-updates/').text
soup = BeautifulSoup(site_text, 'lxml')
posts = soup.find_all('li', class_="ContentListing-module--contentListingItem--3GAoa")
for post in posts:
    post_title = post.find(
        'h3', class_="heading-05 bold ContentListingCard-module--title--1vIFy").text
    post_title = post_title.lower()
    if "patch notes" in post_title:
        patch_ver = post_title.replace('valorant patch notes ', '')
        print(f'Patch version: {patch_ver}')
        print("")

But when I run it, nothing happens at all.

What I want to do is to see if the h3 includes the text "patch notes" and if so, check what version it is and go to https://playvalorant.com/en-us/news/game-updates/valorant-patch-notes-(patch-number)-(patch-number)/ (for example, if the text was "VALORANT Patch Notes 3213.07", then I want to go to https://playvalorant.com/en-us/news/game-updates/valorant-patch-notes-3213-07, and so on.)

I'm getting ahead of myself, but the point is, how can I get the text from this website, and then print it out?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

溺孤伤于心 2025-01-18 19:16:01

你看到的数据是通过Javascript加载的,所以BeautifulSoup看不到它。您可以使用 requests 模块来模拟它

import json
import requests

url = (
    "https://playvalorant.com/page-data/en-us/news/game-updates/page-data.json"
)
data = requests.get(url).json()

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

for a in data["result"]["pageContext"]["data"]["articles"]:
    if "Patch Notes" in a["title"]:
        patch_notes_url = "https://playvalorant.com" + a["url"]["url"]
        print("{:<30} {}".format(a["title"], patch_notes_url))

VALORANT Patch Notes 4.04      https://playvalorant.com/news/game-updates/valorant-patch-notes-4-04/
VALORANT Patch Notes 4.03      https://playvalorant.com/news/game-updates/valorant-patch-notes-4-03/
VALORANT Patch Notes 4.02      https://playvalorant.com/news/game-updates/valorant-patch-notes-4-02/
VALORANT Patch Notes 4.01      https://playvalorant.com/news/game-updates/valorant-patch-notes-4-01/
VALORANT Patch Notes 4.0       https://playvalorant.com/news/game-updates/valorant-patch-notes-4-0/
VALORANT Patch Notes 3.12      https://playvalorant.com/news/game-updates/valorant-patch-notes-3-12/
VALORANT Patch Notes 3.10      https://playvalorant.com/news/game-updates/valorant-patch-notes-3-10/
VALORANT Patch Notes 3.09      https://playvalorant.com/news/game-updates/valorant-patch-notes-3-09/
VALORANT Patch Notes 3.08      https://playvalorant.com/news/game-updates/valorant-patch-notes-3-08/
VALORANT Patch Notes 3.07      https://playvalorant.com/news/game-updates/valorant-patch-notes-3-07/
VALORANT Patch Notes 3.06      https://playvalorant.com/news/game-updates/valorant-patch-notes-3-06/
VALORANT Patch Notes 3.05      https://playvalorant.com/news/game-updates/valorant-patch-notes-3-05/
VALORANT Patch Notes 3.04      https://playvalorant.com/news/game-updates/valorant-patch-notes-3-04/
VALORANT Patch Notes 3.03      https://playvalorant.com/news/game-updates/valorant-patch-notes-3-03/
VALORANT Patch Notes 3.02      https://playvalorant.com/news/game-updates/valorant-patch-notes-3-02/
VALORANT Patch Notes 3.01      https://playvalorant.com/news/game-updates/valorant-patch-notes-3-01/

...and so on.

The data you see is loaded via Javascript, sou BeautifulSoup doesn't see it. You can use requests module to simulate it:

import json
import requests

url = (
    "https://playvalorant.com/page-data/en-us/news/game-updates/page-data.json"
)
data = requests.get(url).json()

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

for a in data["result"]["pageContext"]["data"]["articles"]:
    if "Patch Notes" in a["title"]:
        patch_notes_url = "https://playvalorant.com" + a["url"]["url"]
        print("{:<30} {}".format(a["title"], patch_notes_url))

Prints:

VALORANT Patch Notes 4.04      https://playvalorant.com/news/game-updates/valorant-patch-notes-4-04/
VALORANT Patch Notes 4.03      https://playvalorant.com/news/game-updates/valorant-patch-notes-4-03/
VALORANT Patch Notes 4.02      https://playvalorant.com/news/game-updates/valorant-patch-notes-4-02/
VALORANT Patch Notes 4.01      https://playvalorant.com/news/game-updates/valorant-patch-notes-4-01/
VALORANT Patch Notes 4.0       https://playvalorant.com/news/game-updates/valorant-patch-notes-4-0/
VALORANT Patch Notes 3.12      https://playvalorant.com/news/game-updates/valorant-patch-notes-3-12/
VALORANT Patch Notes 3.10      https://playvalorant.com/news/game-updates/valorant-patch-notes-3-10/
VALORANT Patch Notes 3.09      https://playvalorant.com/news/game-updates/valorant-patch-notes-3-09/
VALORANT Patch Notes 3.08      https://playvalorant.com/news/game-updates/valorant-patch-notes-3-08/
VALORANT Patch Notes 3.07      https://playvalorant.com/news/game-updates/valorant-patch-notes-3-07/
VALORANT Patch Notes 3.06      https://playvalorant.com/news/game-updates/valorant-patch-notes-3-06/
VALORANT Patch Notes 3.05      https://playvalorant.com/news/game-updates/valorant-patch-notes-3-05/
VALORANT Patch Notes 3.04      https://playvalorant.com/news/game-updates/valorant-patch-notes-3-04/
VALORANT Patch Notes 3.03      https://playvalorant.com/news/game-updates/valorant-patch-notes-3-03/
VALORANT Patch Notes 3.02      https://playvalorant.com/news/game-updates/valorant-patch-notes-3-02/
VALORANT Patch Notes 3.01      https://playvalorant.com/news/game-updates/valorant-patch-notes-3-01/

...and so on.
以为你会在 2025-01-18 19:16:01

尝试 lxml 使用 xpath 轻松访问所需的 html 节点。

from lxml import html
import requests

url = "https://playvalorant.com/en-us/news/game-updates/"

response = requests.get(url, stream=True)
tree = html.fromstring(response.content)

posts = tree.xpath('//section[contains(@class, "section light")]/div/ul')

Try lxml to use xpath for accessing the required html nodes easily.

from lxml import html
import requests

url = "https://playvalorant.com/en-us/news/game-updates/"

response = requests.get(url, stream=True)
tree = html.fromstring(response.content)

posts = tree.xpath('//section[contains(@class, "section light")]/div/ul')
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文