数据刮擦以获取产品的分页以获取所有产品详细信息
我想刮除具有url ='https://www.noon.com/uae-en/home-en/home-and-kitchen/home-decor/slipcovers/slipcovers/cushion-cover/'https://www.noon.com/uae-noon.com/uae-noon.com/uae-en/cushion-cover/'的所有产品数据 我分析了数据在脚本标签中,但是如何从所有页面中获取数据。我需要所有页面中的所有产品的URL,并且数据也在API中,对于不同页面api ='https://www.noon.com/_next/data/b60dhzfamqwepel9q8aaje/uae-eme--en/home-home-home-home-and---------------------------------------------------------------------------------------- /home-decor/slipcovers/cushion-cover.json?limit=50&; page = 2&; sort%5bby%5d = popularity&popularity& sort%5bdir%5d = desc&ampc&ampc&amp = home-catalog = home-and-and-kitchen- and-kitchen&amp.amp; amp;catalog; catalog=home-decorcor&amp ;目录= SlipCovers&目录= Cushion-Cover'
如果我们继续更改上面链接中的页面nuM,我们将获得相应页面的数据,但是如何从不同页面获取数据 请为此建议。
import requests
import pandas as pd
import json
import csv
from lxml import html
headers ={'authority': 'www.noon.com',
'accept' :
'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
}
produrl = 'https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/'
prodresp = requests.get(produrl, headers = headers, timeout =30)
prodResphtml = html.fromstring(prodresp.text)
print(prodresp)
partjson = prodResphtml.xpath('//script[@id= "__NEXT_DATA__"]/text()')
partjson = partjson[0]
I want to scrape all the product data for the 'Cushion cover' category having URL = 'https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/'
I analysed the data is in the script tag ,but how to get the data from all the pages. I required the URL's of all the Products from all the pages and the data is also in API for different pages API= 'https://www.noon.com/_next/data/B60DhzfamQWEpEl9Q8ajE/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover.json?limit=50&page=2&sort%5Bby%5D=popularity&sort%5Bdir%5D=desc&catalog=home-and-kitchen&catalog=home-decor&catalog=slipcovers&catalog=cushion-cover'
if we goes on changing the page num in the above link we have the data for the respective pages but how to get that data from different pages
Please suggest for this.
import requests
import pandas as pd
import json
import csv
from lxml import html
headers ={'authority': 'www.noon.com',
'accept' :
'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
}
produrl = 'https://www.noon.com/uae-en/home-and-kitchen/home-decor/slipcovers/cushion-cover/'
prodresp = requests.get(produrl, headers = headers, timeout =30)
prodResphtml = html.fromstring(prodresp.text)
print(prodresp)
partjson = prodResphtml.xpath('//script[@id= "__NEXT_DATA__"]/text()')
partjson = partjson[0]
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
您将要实现自己的目标。您可以使用
进行循环和范围函数
提取所有页面的分页,以拉开所有页面,因为我们知道总页码是192,这就是为什么我以这种强大的方式制作了分页的原因。因此,要从所有页面中获取所有产品url
(或任何数据项),您可以按照下一个示例。脚本:
输出:
You are about to reach your goal. You can make the next pages meaning pagination using
for loop and range function
to pull all the pages as we know that total page numbers are 192 that's why I've made the pagination this robust way. So to get all the productsurl
(or any data item) from all of the pages, you can follow the next example.Script:
Output:
我使用了lib。换句话说,我使用的是言论,最好刮擦任何页面
I used re lib. In other word, I used regex it is much better to scrape any page use JavaScript