无法刮擦网页

发布于 2025-02-13 20:43:07 字数 3347 浏览 3 评论 0原文

我有问题尝试用Spyder刮擦多个页面的网络:Web有1到6页,还有一个下一个按钮。另外,六页中的每一个都有30个结果。我尝试了两种解决方案而没有成功。

这是第一个:

#SOLUTION 1#
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://store.unionlosangeles.com/collections/outerwear?sort_by=creation_date&page_num=1')

#Imports the HTML of the webpage into python      
soup = BeautifulSoup(driver.page_source, 'lxml')

postings = soup.find_all('div', class_ = 'isp_grid_product')

#Creates data frame
df = pd.DataFrame({'Link':[''], 'Vendor':[''],'Title':[''], 'Price':['']})

#Scrape the data
for i in range (1,7): #I've also tried with range (1,6), but it gives 5 pages instead of 6.
    url = "https://store.unionlosangeles.com/collections/outerwear?sort_by=creation_date&page_num="+str(i)+""
    postings = soup.find_all('li', class_ = 'isp_grid_product')
    for post in postings:
        link = post.find('a', class_ = 'isp_product_image_href').get('href')
        link_full = 'https://store.unionlosangeles.com'+link
        vendor = post.find('div', class_ = 'isp_product_vendor').text.strip()
        title = post.find('div', class_ = 'isp_product_title').text.strip()
        price = post.find('div', class_ = 'isp_product_price_wrapper').text.strip()
        df = df.append({'Link':link_full, 'Vendor':vendor,'Title':title, 'Price':price}, ignore_index = True)

该代码的输出是一个具有180行(30 x 6)的数据框架,但它重复结果 第一页。因此,我的前30行是第一页的前30个结果,而行31-60再次与第一页的结果相同,依此类推。

这是我尝试过的第二个解决方案:

### SOLUTION 2 ###

from selenium import webdriver
import requests
from bs4 import BeautifulSoup
import pandas as pd
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager


driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://store.unionlosangeles.com/collections/outerwear?sort_by=creation_date&page_num=1')

#Imports the HTML of the webpage into python      
soup = BeautifulSoup(driver.page_source, 'lxml')
soup

#Create data frame
df = pd.DataFrame({'Link':[''], 'Vendor':[''],'Title':[''], 'Price':['']})

#Scrape data
i = 0
while i < 6:
    
    postings = soup.find_all('li', class_ = 'isp_grid_product')
    len(postings)

    for post in postings:
        link = post.find('a', class_ = 'isp_product_image_href').get('href')
        link_full = 'https://store.unionlosangeles.com'+link
        vendor = post.find('div', class_ = 'isp_product_vendor').text.strip()
        title = post.find('div', class_ = 'isp_product_title').text.strip()
        price = post.find('div', class_ = 'isp_product_price_wrapper').text.strip()
        df = df.append({'Link':link_full, 'Vendor':vendor,'Title':title, 'Price':price}, ignore_index = True)

    #Imports the next pages HTML into python
    next_page = 'https://store.unionlosangeles.com'+soup.find('div', class_ = 'page-item next').get('href')
    page = requests.get(next_page)
    soup = BeautifulSoup(page.text, 'lxml')
    i += 1

第二个解决方案的问题是,由于我无法掌握的原因,程序无法识别next_page中的属性“ get”(我没有在其他问题中遇到这个问题带有标志的网)。因此,我只得到第一页,而不是其他页面。

如何修复代码以正确刮擦所有180个元素?

I have problems trying to scrape a web with multiple pages with Spyder: the web has 1 to 6 pages and also a next button. Also, each of one the six pages has 30 results. I've tried two solutions without success.

This is the first one:

#SOLUTION 1#
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://store.unionlosangeles.com/collections/outerwear?sort_by=creation_date&page_num=1')

#Imports the HTML of the webpage into python      
soup = BeautifulSoup(driver.page_source, 'lxml')

postings = soup.find_all('div', class_ = 'isp_grid_product')

#Creates data frame
df = pd.DataFrame({'Link':[''], 'Vendor':[''],'Title':[''], 'Price':['']})

#Scrape the data
for i in range (1,7): #I've also tried with range (1,6), but it gives 5 pages instead of 6.
    url = "https://store.unionlosangeles.com/collections/outerwear?sort_by=creation_date&page_num="+str(i)+""
    postings = soup.find_all('li', class_ = 'isp_grid_product')
    for post in postings:
        link = post.find('a', class_ = 'isp_product_image_href').get('href')
        link_full = 'https://store.unionlosangeles.com'+link
        vendor = post.find('div', class_ = 'isp_product_vendor').text.strip()
        title = post.find('div', class_ = 'isp_product_title').text.strip()
        price = post.find('div', class_ = 'isp_product_price_wrapper').text.strip()
        df = df.append({'Link':link_full, 'Vendor':vendor,'Title':title, 'Price':price}, ignore_index = True)

The output of this code is a data frame with 180 rows (30 x 6), but it repeats the results
of the first page. Thus, my first 30 rows are the first 30 results of the first page, and the rows 31-60 are again the same results of the first page and so on.

Here is the second solution I tried:

### SOLUTION 2 ###

from selenium import webdriver
import requests
from bs4 import BeautifulSoup
import pandas as pd
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager


driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://store.unionlosangeles.com/collections/outerwear?sort_by=creation_date&page_num=1')

#Imports the HTML of the webpage into python      
soup = BeautifulSoup(driver.page_source, 'lxml')
soup

#Create data frame
df = pd.DataFrame({'Link':[''], 'Vendor':[''],'Title':[''], 'Price':['']})

#Scrape data
i = 0
while i < 6:
    
    postings = soup.find_all('li', class_ = 'isp_grid_product')
    len(postings)

    for post in postings:
        link = post.find('a', class_ = 'isp_product_image_href').get('href')
        link_full = 'https://store.unionlosangeles.com'+link
        vendor = post.find('div', class_ = 'isp_product_vendor').text.strip()
        title = post.find('div', class_ = 'isp_product_title').text.strip()
        price = post.find('div', class_ = 'isp_product_price_wrapper').text.strip()
        df = df.append({'Link':link_full, 'Vendor':vendor,'Title':title, 'Price':price}, ignore_index = True)

    #Imports the next pages HTML into python
    next_page = 'https://store.unionlosangeles.com'+soup.find('div', class_ = 'page-item next').get('href')
    page = requests.get(next_page)
    soup = BeautifulSoup(page.text, 'lxml')
    i += 1

The problem with this second solution is that the program cannot recognize the attribute "get" in next_page, for reasons I cannot grasp (I haven't had this problem in other webs with paginations). Thus, I get only the first page and not the others.

How can I fix the code to properly scrape all 180 elements?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

皇甫轩 2025-02-20 20:43:07

您看到的数据通过JavaScript从外部URL加载。您可以使用请求模块模拟这些调用。例如:

import requests
import pandas as pd
from bs4 import BeautifulSoup
from urllib.parse import urlparse, parse_qs

url = "https://store.unionlosangeles.com/collections/outerwear?sort_by=creation_date&page_num=1"
api_url = "https://cdn-gae-ssl-premium.akamaized.net/categories_navigation"

soup = BeautifulSoup(requests.get(url).content, "html.parser")

params = {
    "page_num": 1,
    "store_id": "",
    "UUID": "",
    "sort_by": "creation_date",
    "facets_required": "0",
    "callback": "",
    "related_search": "1",
    "category_url": "/collections/outerwear",
}

q = parse_qs(
    urlparse(soup.select_one("#isp_search_result_page ~ script")["src"]).query
)

params["store_id"] = q["store_id"][0]
params["UUID"] = q["UUID"][0]

all_data = []
for params["page_num"] in range(1, 7):
    data = requests.get(api_url, params=params).json()
    for i in data["items"]:
        link = i["u"]
        vendor = i["v"]
        title = i["l"]
        price = i["p"]

        all_data.append([link, vendor, title, price])

df = pd.DataFrame(all_data, columns=["link", "vendor", "title", "price"])
print(df.head(10).to_markdown(index=False))
print("Total items =", len(df))

打印:

链接供应商标题价格
/产品 /谷仓杰克必需品谷仓夹克250
/products /work-vest-2Essentials工作vest120
/products /partored-track-jacket-jacket-martine rose玫瑰玫瑰量量身定制田径夹克1206
/products /works- Vest-1EssentialsWork Vest120
/products/60-40-Cloth-bug-anorak-1 toneKapital60/40布虫Anorak(1 Tone)747
/products /Smooth-Jersey-Stand-Man-Woman-Woman-Track-JktKapitalSmooth Smooth sand and Man&amp;女子轨道JKT423
/产品 /超大型运动夹克马丁·玫瑰玫瑰超大型运动夹克1695
nicholasdaley daleydaley套头衫267
/ products /套头衫product/ X头巾可逆的1st JKT645
/products/60-40-cloth-bug-anorak-1tone-1kapital60/40布虫anorak(1 Tone)747
Total items = 175

The data you see is loaded from external URL via javascript. You can simulate these calls with requests module. For example:

import requests
import pandas as pd
from bs4 import BeautifulSoup
from urllib.parse import urlparse, parse_qs

url = "https://store.unionlosangeles.com/collections/outerwear?sort_by=creation_date&page_num=1"
api_url = "https://cdn-gae-ssl-premium.akamaized.net/categories_navigation"

soup = BeautifulSoup(requests.get(url).content, "html.parser")

params = {
    "page_num": 1,
    "store_id": "",
    "UUID": "",
    "sort_by": "creation_date",
    "facets_required": "0",
    "callback": "",
    "related_search": "1",
    "category_url": "/collections/outerwear",
}

q = parse_qs(
    urlparse(soup.select_one("#isp_search_result_page ~ script")["src"]).query
)

params["store_id"] = q["store_id"][0]
params["UUID"] = q["UUID"][0]

all_data = []
for params["page_num"] in range(1, 7):
    data = requests.get(api_url, params=params).json()
    for i in data["items"]:
        link = i["u"]
        vendor = i["v"]
        title = i["l"]
        price = i["p"]

        all_data.append([link, vendor, title, price])

df = pd.DataFrame(all_data, columns=["link", "vendor", "title", "price"])
print(df.head(10).to_markdown(index=False))
print("Total items =", len(df))

Prints:

linkvendortitleprice
/products/barn-jacketEssentialsBARN JACKET250
/products/work-vest-2EssentialsWORK VEST120
/products/tailored-track-jacketMartine RoseTAILORED TRACK JACKET1206
/products/work-vest-1EssentialsWORK VEST120
/products/60-40-cloth-bug-anorak-1toneKapital60/40 Cloth BUG Anorak (1Tone)747
/products/smooth-jersey-stand-man-woman-track-jktKapitalSmooth Jersey STAND MAN & WOMAN Track JKT423
/products/supersized-sports-jacketMartine RoseSUPERSIZED SPORTS JACKET1695
/products/pullover-vestNicholas DaleyPULLOVER VEST267
/products/flannel-polkadot-x-bandana-reversible-1st-jkt-1KapitalFLANNEL POLKADOT X BANDANA REVERSIBLE 1ST JKT645
/products/60-40-cloth-bug-anorak-1tone-1Kapital60/40 Cloth BUG Anorak (1Tone)747
Total items = 175
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文