从多个URL中提取P标签

发布于 2025-02-11 06:03:55 字数 2035 浏览 1 评论 0原文

我已经为此挣扎了几天,不确定问题是什么 - 基本上,我正在尝试提取每个链接的配置文件框数据(图下图) - 通过检查员,我认为我可以拉P标签并这样做。

我是新手,并试图理解,但这是我到目前为止所拥有的:

- (某种程度上)成功地提取一个链接的信息:

import requests
from bs4 import BeautifulSoup


# getting html

url = 'https://basketball.realgm.com/player/Darius-Adams/Summary/28720'
req = requests.get(url)

soup = BeautifulSoup(req.text, 'html.parser')

container = soup.find('div', attrs={'class', 'main-container'})
playerinfo = container.find_all('p')


print(playerinfo)

然后,我也有一个代码,从多个链接中拉出所有HREF标签:

    from bs4 import BeautifulSoup
import requests


def get_links(url):

    links = []

    website = requests.get(url)
    website_text = website.text
    soup = BeautifulSoup(website_text)

    for link in soup.find_all('a'):

        links.append(link.get('href'))

    for link in links:
        print(link)

        print(len(links))


get_links('https://basketball.realgm.com/dleague/players/2022')
get_links('https://basketball.realgm.com/dleague/players/2021')
get_links('https://basketball.realgm.com/dleague/players/2020')

因此,基本上,我的目标是结合这两个标签,并获取一个代码,该代码将从多个URL中提取所有P标签。我一直在尝试做到这一点,而且我真的不确定为什么这在这里不起作用:

    from bs4 import BeautifulSoup
import requests


def get_profile(url):

    profiles = []

    req = requests.get(url)
    soup = BeautifulSoup(req.text, 'html.parser')
    container = soup.find('div', attrs={'class', 'main-container'})

    for profile in container.find_all('a'):

        profiles.append(profile.get('p'))

    for profile in profiles:
        print(profile)


get_profile('https://basketball.realgm.com/player/Darius-Adams/Summary/28720')
get_profile('https://basketball.realgm.com/player/Marial-Shayok/Summary/26697')

再次,我真的是与Python一起抓取网络的新手,但是任何建议都将不胜感激。最终,我的最终目标是拥有一个可以立即以干净的方式刮擦此数据的工具。

(球员名称,当前团队,出生,出生地等)。也许我做错了,但欢迎任何指导!

I've struggled on this for days and not sure what the issue could be - basically, I'm trying to extract the profile box data (picture below) of each link -- going through inspector, I thought I could pull the p tags and do so.

enter image description here

I'm new to this and trying to understand, but here's what I have thus far:

-- a code that (somewhat) succesfully pulls the info for ONE link:

import requests
from bs4 import BeautifulSoup


# getting html

url = 'https://basketball.realgm.com/player/Darius-Adams/Summary/28720'
req = requests.get(url)

soup = BeautifulSoup(req.text, 'html.parser')

container = soup.find('div', attrs={'class', 'main-container'})
playerinfo = container.find_all('p')


print(playerinfo)

I then also have a code that pulls all of the HREF tags from multiple links:

    from bs4 import BeautifulSoup
import requests


def get_links(url):

    links = []

    website = requests.get(url)
    website_text = website.text
    soup = BeautifulSoup(website_text)

    for link in soup.find_all('a'):

        links.append(link.get('href'))

    for link in links:
        print(link)

        print(len(links))


get_links('https://basketball.realgm.com/dleague/players/2022')
get_links('https://basketball.realgm.com/dleague/players/2021')
get_links('https://basketball.realgm.com/dleague/players/2020')

So basically, my goal is to combine these two, and get one code that will pull all of the P tags from multiple URLs. I've been trying to do it, and I'm really not sure at all why this isn't working here:

    from bs4 import BeautifulSoup
import requests


def get_profile(url):

    profiles = []

    req = requests.get(url)
    soup = BeautifulSoup(req.text, 'html.parser')
    container = soup.find('div', attrs={'class', 'main-container'})

    for profile in container.find_all('a'):

        profiles.append(profile.get('p'))

    for profile in profiles:
        print(profile)


get_profile('https://basketball.realgm.com/player/Darius-Adams/Summary/28720')
get_profile('https://basketball.realgm.com/player/Marial-Shayok/Summary/26697')

Again, I'm really new to web scraping with Python but any advice would be greatly appreciated. Ultimately, my end goal is to have a tool that can scrape this data in a clean way all at once.

(Player name, Current Team, Born, Birthplace, etc).. maybe I'm doing it entirely wrong but any guidance is welcome!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

三人与歌 2025-02-18 06:03:55

您需要将两个脚本组合在一起,并为每个玩家提出请求。尝试以下方法。这将搜索具有data-td = player属性:

import requests
from bs4 import BeautifulSoup

def get_links(url):
    data = []
    req_url = requests.get(url)
    soup = BeautifulSoup(req_url.content, "html.parser")

    for td in soup.find_all('td', {'data-th' : 'Player'}):
        a_tag = td.a
        name = a_tag.text
        player_url = a_tag['href']
        print(f"Getting {name}")

        req_player_url = requests.get(f"https://basketball.realgm.com{player_url}")
        soup_player = BeautifulSoup(req_player_url.content, "html.parser")
        div_profile_box = soup_player.find("div", class_="profile-box")
        row = {"Name" : name, "URL" : player_url}
        
        for p in div_profile_box.find_all("p"):
            try:
                key, value = p.get_text(strip=True).split(':', 1)
                row[key.strip()] = value.strip()
            except:     # not all entries have values
                pass

        data.append(row)

    return data

urls = [
    'https://basketball.realgm.com/dleague/players/2022',
    'https://basketball.realgm.com/dleague/players/2021',
    'https://basketball.realgm.com/dleague/players/2020',
]


for url in urls:
    print(f"Getting: {url}")
    data = get_links(url)
    
    for entry in data:
        print(entry)

You need to combine your two scripts together and make requests for each player. Try the following approach. This searches for <td> tags that have the data-td=Player attribute:

import requests
from bs4 import BeautifulSoup

def get_links(url):
    data = []
    req_url = requests.get(url)
    soup = BeautifulSoup(req_url.content, "html.parser")

    for td in soup.find_all('td', {'data-th' : 'Player'}):
        a_tag = td.a
        name = a_tag.text
        player_url = a_tag['href']
        print(f"Getting {name}")

        req_player_url = requests.get(f"https://basketball.realgm.com{player_url}")
        soup_player = BeautifulSoup(req_player_url.content, "html.parser")
        div_profile_box = soup_player.find("div", class_="profile-box")
        row = {"Name" : name, "URL" : player_url}
        
        for p in div_profile_box.find_all("p"):
            try:
                key, value = p.get_text(strip=True).split(':', 1)
                row[key.strip()] = value.strip()
            except:     # not all entries have values
                pass

        data.append(row)

    return data

urls = [
    'https://basketball.realgm.com/dleague/players/2022',
    'https://basketball.realgm.com/dleague/players/2021',
    'https://basketball.realgm.com/dleague/players/2020',
]


for url in urls:
    print(f"Getting: {url}")
    data = get_links(url)
    
    for entry in data:
        print(entry)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文