无法刮擦网页
我有问题尝试用Spyder刮擦多个页面的网络:Web有1到6页,还有一个下一个按钮。另外,六页中的每一个都有30个结果。我尝试了两种解决方案而没有成功。
这是第一个:
#SOLUTION 1#
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://store.unionlosangeles.com/collections/outerwear?sort_by=creation_date&page_num=1')
#Imports the HTML of the webpage into python
soup = BeautifulSoup(driver.page_source, 'lxml')
postings = soup.find_all('div', class_ = 'isp_grid_product')
#Creates data frame
df = pd.DataFrame({'Link':[''], 'Vendor':[''],'Title':[''], 'Price':['']})
#Scrape the data
for i in range (1,7): #I've also tried with range (1,6), but it gives 5 pages instead of 6.
url = "https://store.unionlosangeles.com/collections/outerwear?sort_by=creation_date&page_num="+str(i)+""
postings = soup.find_all('li', class_ = 'isp_grid_product')
for post in postings:
link = post.find('a', class_ = 'isp_product_image_href').get('href')
link_full = 'https://store.unionlosangeles.com'+link
vendor = post.find('div', class_ = 'isp_product_vendor').text.strip()
title = post.find('div', class_ = 'isp_product_title').text.strip()
price = post.find('div', class_ = 'isp_product_price_wrapper').text.strip()
df = df.append({'Link':link_full, 'Vendor':vendor,'Title':title, 'Price':price}, ignore_index = True)
该代码的输出是一个具有180行(30 x 6)的数据框架,但它重复结果 第一页。因此,我的前30行是第一页的前30个结果,而行31-60再次与第一页的结果相同,依此类推。
这是我尝试过的第二个解决方案:
### SOLUTION 2 ###
from selenium import webdriver
import requests
from bs4 import BeautifulSoup
import pandas as pd
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://store.unionlosangeles.com/collections/outerwear?sort_by=creation_date&page_num=1')
#Imports the HTML of the webpage into python
soup = BeautifulSoup(driver.page_source, 'lxml')
soup
#Create data frame
df = pd.DataFrame({'Link':[''], 'Vendor':[''],'Title':[''], 'Price':['']})
#Scrape data
i = 0
while i < 6:
postings = soup.find_all('li', class_ = 'isp_grid_product')
len(postings)
for post in postings:
link = post.find('a', class_ = 'isp_product_image_href').get('href')
link_full = 'https://store.unionlosangeles.com'+link
vendor = post.find('div', class_ = 'isp_product_vendor').text.strip()
title = post.find('div', class_ = 'isp_product_title').text.strip()
price = post.find('div', class_ = 'isp_product_price_wrapper').text.strip()
df = df.append({'Link':link_full, 'Vendor':vendor,'Title':title, 'Price':price}, ignore_index = True)
#Imports the next pages HTML into python
next_page = 'https://store.unionlosangeles.com'+soup.find('div', class_ = 'page-item next').get('href')
page = requests.get(next_page)
soup = BeautifulSoup(page.text, 'lxml')
i += 1
第二个解决方案的问题是,由于我无法掌握的原因,程序无法识别next_page
中的属性“ get”(我没有在其他问题中遇到这个问题带有标志的网)。因此,我只得到第一页,而不是其他页面。
如何修复代码以正确刮擦所有180个元素?
I have problems trying to scrape a web with multiple pages with Spyder: the web has 1 to 6 pages and also a next button. Also, each of one the six pages has 30 results. I've tried two solutions without success.
This is the first one:
#SOLUTION 1#
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://store.unionlosangeles.com/collections/outerwear?sort_by=creation_date&page_num=1')
#Imports the HTML of the webpage into python
soup = BeautifulSoup(driver.page_source, 'lxml')
postings = soup.find_all('div', class_ = 'isp_grid_product')
#Creates data frame
df = pd.DataFrame({'Link':[''], 'Vendor':[''],'Title':[''], 'Price':['']})
#Scrape the data
for i in range (1,7): #I've also tried with range (1,6), but it gives 5 pages instead of 6.
url = "https://store.unionlosangeles.com/collections/outerwear?sort_by=creation_date&page_num="+str(i)+""
postings = soup.find_all('li', class_ = 'isp_grid_product')
for post in postings:
link = post.find('a', class_ = 'isp_product_image_href').get('href')
link_full = 'https://store.unionlosangeles.com'+link
vendor = post.find('div', class_ = 'isp_product_vendor').text.strip()
title = post.find('div', class_ = 'isp_product_title').text.strip()
price = post.find('div', class_ = 'isp_product_price_wrapper').text.strip()
df = df.append({'Link':link_full, 'Vendor':vendor,'Title':title, 'Price':price}, ignore_index = True)
The output of this code is a data frame with 180 rows (30 x 6), but it repeats the results
of the first page. Thus, my first 30 rows are the first 30 results of the first page, and the rows 31-60 are again the same results of the first page and so on.
Here is the second solution I tried:
### SOLUTION 2 ###
from selenium import webdriver
import requests
from bs4 import BeautifulSoup
import pandas as pd
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://store.unionlosangeles.com/collections/outerwear?sort_by=creation_date&page_num=1')
#Imports the HTML of the webpage into python
soup = BeautifulSoup(driver.page_source, 'lxml')
soup
#Create data frame
df = pd.DataFrame({'Link':[''], 'Vendor':[''],'Title':[''], 'Price':['']})
#Scrape data
i = 0
while i < 6:
postings = soup.find_all('li', class_ = 'isp_grid_product')
len(postings)
for post in postings:
link = post.find('a', class_ = 'isp_product_image_href').get('href')
link_full = 'https://store.unionlosangeles.com'+link
vendor = post.find('div', class_ = 'isp_product_vendor').text.strip()
title = post.find('div', class_ = 'isp_product_title').text.strip()
price = post.find('div', class_ = 'isp_product_price_wrapper').text.strip()
df = df.append({'Link':link_full, 'Vendor':vendor,'Title':title, 'Price':price}, ignore_index = True)
#Imports the next pages HTML into python
next_page = 'https://store.unionlosangeles.com'+soup.find('div', class_ = 'page-item next').get('href')
page = requests.get(next_page)
soup = BeautifulSoup(page.text, 'lxml')
i += 1
The problem with this second solution is that the program cannot recognize the attribute "get" in next_page
, for reasons I cannot grasp (I haven't had this problem in other webs with paginations). Thus, I get only the first page and not the others.
How can I fix the code to properly scrape all 180 elements?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您看到的数据通过JavaScript从外部URL加载。您可以使用
请求
模块模拟这些调用。例如:打印:
The data you see is loaded from external URL via javascript. You can simulate these calls with
requests
module. For example:Prints: