如何使用Python/Beautiful Soup从网页上的多个页面上的多个页面上刮擦超链接？

发布于 2025-01-26 18:54:39 字数 1090 浏览 3 评论 0原文

此网页上有一个分页的超链接列表： httpps：httpps：// wwww。 farmersforum.ie/mart-reports/county-tipperary-mart/ 。

我迄今为止创建的代码删除了第一页的相关链接。我无法弄清楚如何从后续页面中提取链接（每页8个链接，大约25页）。

似乎没有一种使用URL导航页面的方法。


    from bs4 import BeautifulSoup
    import urllib.request
    
    # Scrape webpage
    parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
    resp = urllib.request.urlopen("https://www.farmersforum.ie/mart-reports/county-Tipperary-mart/")
    soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))
    
    # Extract links
    links = []
    for link in soup.find_all('a', href=True):
        links.append(link['href'])
    
    # Select relevant links, reformat, and drop duplicates    
    links = list(dict.fromkeys(["https://www.farmersforum.ie"+link for link in links if "/reports/Thurles" in link]))

请告知我如何使用Python做到这一点。

原文

There is a paginated list of hyperlinks on this webpage: https://www.farmersforum.ie/mart-reports/county-Tipperary-mart/.

The code I have created till now scrapes the relevant links from the first page. I cannot figure out how to extract links from subsequent pages (8 links per page, about 25 pages).

There does not seem to be a way to navigate the pages using the URL.


    from bs4 import BeautifulSoup
    import urllib.request
    
    # Scrape webpage
    parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
    resp = urllib.request.urlopen("https://www.farmersforum.ie/mart-reports/county-Tipperary-mart/")
    soup = BeautifulSoup(resp, parser, from_encoding=resp.info().get_param('charset'))
    
    # Extract links
    links = []
    for link in soup.find_all('a', href=True):
        links.append(link['href'])
    
    # Select relevant links, reformat, and drop duplicates    
    links = list(dict.fromkeys(["https://www.farmersforum.ie"+link for link in links if "/reports/Thurles" in link]))

Please advise for how I can do this using Python.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

榆西 2025-02-02 18:54:39

我已经用硒解决了这个问题。谢谢。

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time

# Launch Chrome driver
driver = webdriver.Chrome(ChromeDriverManager().install())
# Open webpage
driver.get("https://www.farmersforum.ie/mart-reports/county-Tipperary-mart/")

# Loop through pages
allLnks = []
iStop = False
# Continue until fail to find button
while iStop == False:
    for ii in range(2,12):
        try:
            # Click page
            driver.find_element_by_xpath('//*[@id="mainContent"]/div/div[1]/div[2]/ul/li['+str(ii)+']/a').click()
        except:
            iStop = True
            break
        # Wait to load
        time.sleep(0.1)
        # Identify elements with tagname <a> 
        lnks=driver.find_elements_by_tag_name("a")
        # Traverse list of links
        iiLnks = []
        for lnk in lnks:
           # Use get_attribute() to get all href and add links to list
           iiLnks.append(lnk.get_attribute("href"))
        # Select relevant links, reformat, and drop duplicates    
        iiLnks = list(dict.fromkeys([iiLnk for iiLnk in iiLnks if "/reports/Thurles" in iiLnk]))        
        allLnks = allLnks + iiLnks
    driver.find_element_by_xpath('//*[@id="mainContent"]/div/div[1]/div[2]/ul/li[12]/a').click()
driver.quit()

I've solved this with Selenium. Thank you.

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import time

# Launch Chrome driver
driver = webdriver.Chrome(ChromeDriverManager().install())
# Open webpage
driver.get("https://www.farmersforum.ie/mart-reports/county-Tipperary-mart/")

# Loop through pages
allLnks = []
iStop = False
# Continue until fail to find button
while iStop == False:
    for ii in range(2,12):
        try:
            # Click page
            driver.find_element_by_xpath('//*[@id="mainContent"]/div/div[1]/div[2]/ul/li['+str(ii)+']/a').click()
        except:
            iStop = True
            break
        # Wait to load
        time.sleep(0.1)
        # Identify elements with tagname <a> 
        lnks=driver.find_elements_by_tag_name("a")
        # Traverse list of links
        iiLnks = []
        for lnk in lnks:
           # Use get_attribute() to get all href and add links to list
           iiLnks.append(lnk.get_attribute("href"))
        # Select relevant links, reformat, and drop duplicates    
        iiLnks = list(dict.fromkeys([iiLnk for iiLnk in iiLnks if "/reports/Thurles" in iiLnk]))        
        allLnks = allLnks + iiLnks
    driver.find_element_by_xpath('//*[@id="mainContent"]/div/div[1]/div[2]/ul/li[12]/a').click()
driver.quit()

回复收藏 0 原文

~没有更多了~