我需要帮助获得每个页面的链接

发布于 2025-02-13 20:30:30 字数 531 浏览 2 评论 0 原文

我正在尝试从

url = 'https://apexranked.com/'

page = 1 

while page != 121: 
    url = f'https://apexranked.com/?page={page}'
    print(url) 
    page = page + 1

,但是,如果您单击页码,则不包括a https://apexranked.com /?pag = number 如您从 https://www.mlb.com/stats/?page=2 。如果页面不包含?page =链接之后,我将如何从所有页面访问和获取链接?

I am trying to get the links from all the pages on https://apexranked.com/. I tried using

url = 'https://apexranked.com/'

page = 1 

while page != 121: 
    url = f'https://apexranked.com/?page={page}'
    print(url) 
    page = page + 1

however, if you click on the page numbers it doesn't include a https://apexranked.com/?page=number as you see from https://www.mlb.com/stats/?page=2. How would I go about accessing and getting the links from all pages if the page doesn't include ?page=number after the link?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

鼻尖触碰 2025-02-20 20:30:30

单击第2页时,该页面没有重新加载。相反,它向网站的后端发出了 get 请求。
该请求将发送至: https://apexranked.com/wp-admin/admin/admin/admin/admin/admin/admin/admin/admin/admin -ajax.php
另外,将几个参数直接解析到先前的URL上。
?action = get_player_data& pag = 3& total_pages = 195& _ = 1657230896643

参数

  • 操作:动作:作为端点可以处理多个目的,您必须指示执行的操作。当然是强制性参数,不要省略它。
  • 页面:指示请求的页面(即索引您要仔细阅读)。
  • total_pages:指示页面的总数(也许可以省略,否则您可以在主页上删除它)
  • _:这是一个与上面相同的unix时间戳,相同的想法对应,请尝试忽略并查看会发生什么。 您可以使用 time.time()非常轻松地获得UNIX TIMESTAMP

否则, 在请求标题中登录以获取JSON,但这只是一个细节。

所有这些信息都结束了:

import requests
import time

url = "https://apexranked.com/wp-admin/admin-ajax.php"

# Issued from a previous scraping on the main page
total_pages = 195

params = {
    "total_pages": total_pages,
    "_": round(time.time() * 1000),
    "action": "get_player_data"
}

# Make sure to include all mandatory fields
headers = {
    ...
}

for k in range(1, total_pages + 1):
     params['page'] = k
     res = requests.get(url, headers=headers, params=params)
     # Make your thing :)

The page is not reloading when you click on page 2. Instead, it is firing a GET request to the website's backend.
The request is being sent to : https://apexranked.com/wp-admin/admin-ajax.php
In addition, several parameters are parsed directly onto the previous url.
?action=get_player_data&page=3&total_pages=195&_=1657230896643

Parameters :

  • action: As the endpoint can handle several purpose, you must indicate the performed action. Surely a mandatory parameter, don't omit it.
  • page: indicates the requested page (i.e the index you're iteraring over).
  • total_pages: indicates the total number of page (maybe it can be omitted, otherwise you can scrap it on the main page)
  • _: this one corresponds to an unix timestamp, same idea as above, try to omit and see what happens. Otherwise you can get a unix timestamp quite easily with time.time()

Once you get a response, it yields a rendered HTML, maybe try to set Accept: application/json field in request headers to get a Json, but that's just a detail.

All these informations wrapped up:

import requests
import time

url = "https://apexranked.com/wp-admin/admin-ajax.php"

# Issued from a previous scraping on the main page
total_pages = 195

params = {
    "total_pages": total_pages,
    "_": round(time.time() * 1000),
    "action": "get_player_data"
}

# Make sure to include all mandatory fields
headers = {
    ...
}

for k in range(1, total_pages + 1):
     params['page'] = k
     res = requests.get(url, headers=headers, params=params)
     # Make your thing :)
还不是爱你 2025-02-20 20:30:30

我不确定您的

import requests
# A loop that will keep going until the page is not found.
while(requests.get(f"https://apexranked.com/?page={page}").status_code != 404):
    #scrap content e.g whole page 
    link = f"https://apexranked.com/?page={page}"
    page = page + 1

意思

I don't exactly know what you mean but if you for example wanna get the raw text u can do it with requests

import requests
# A loop that will keep going until the page is not found.
while(requests.get(f"https://apexranked.com/?page={page}").status_code != 404):
    #scrap content e.g whole page 
    link = f"https://apexranked.com/?page={page}"
    page = page + 1

you can also add the link then to an array with nameOfArray.append(link)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文