目前,我已成功使用Python从竞争对手的网站上刮擦数据,以查找商店信息。该网站有一个地图,您可以在其中输入邮政编码,它将告诉您我当前位置区域中的所有商店。该网站通过使用以下链接发送get请求以撤销存储数据:
https://www.homedepot.com/storesearchservices/v2/storesearch?address=37028&radius=50&pagesize =
30 = 12345& PAGESIZE = 30。
我应该如何获取所有商店信息?通过邮政编码的数据集迭代以吸引所有商店,还是有更好的方法来迭代?我尝试扩展超过30页的大小,但看起来这是请求的限制。
Currently, I have successfully used python to scrape data from a competitor's website to find out store information. The website has a map where you can enter a zip code and it will tell you all the stores in the area of a my current location. The website sends a GET request to pull store data by using this link:
https://www.homedepot.com/StoreSearchServices/v2/storesearch?address=37028&radius=50&pagesize=30
My goal is to scrape all store information not just the imaginary zip code = 12345 & pagesize=30.
How should I go about getting all the store information? Would it be better to iterate through a dataset of zip codes to pull all the stores or is there a better way to do this? I've tried expanding past 30 page size but it looks like that is the limit on the request.
发布评论
评论(2)
此URL为JSON提供了
“ CurrentPage”:1
,这意味着它可以使用某种分页。我添加了
& pag = 2
,并且看来它可以工作第1页:
https://www.homedepot.com/storesearchservices/v2/storesearch?address=37028& amp.amp; radius = 250& page250&pagesize = 40 = 40 = 40&pagepage = 40&page =/a >
第2页:
page 3:
https://www.homedepot.com/storesearchsearchseachseachsercesseachserices/storesearchserceces/storesearchserceces/storesearchserceces/v2/v2/v2/storsearch?addresseachearch?半径= 250& pageize = 40& pag = 3
对于测试,我使用更大的
range = 250
用“ recordCount”:123
我发现了它还可以使用
pageize = 40
。对于更大的价值,它会发送带有错误消息的JSON。
编辑:
最少的工作代码:
没有
用户代理
结果的页面块请求:
如果要保留为
dataframe
,则可能首先将所有项目放在列表,然后以后将此列表转换为dataFrame
,因为JSON Keep
address
AS Directory{'Post Code':...,...,...}
具有目录,请参见:
{}
in地址
,services
,StoreHours
等列可能 将其转换为分离的行。
并将其与原始
df
与其他列进行相同的方式。
This url gives JSON with
"currentPage":1
which can means it can use some kind of pagination.I added
&page=2
and it seems it worksPage 1:
https://www.homedepot.com/StoreSearchServices/v2/storesearch?address=37028&radius=250&pagesize=40&page=1
Page 2:
https://www.homedepot.com/StoreSearchServices/v2/storesearch?address=37028&radius=250&pagesize=40&page=2
Page 3:
https://www.homedepot.com/StoreSearchServices/v2/storesearch?address=37028&radius=250&pagesize=40&page=3
For test I use bigger
range=250
to get JSON with"recordCount":123
I found that it works also with
pagesize=40
.For bigger value it sends JSON with error message.
EDIT:
Minimal working code:
Page blocks request without
User-Agent
Result:
If you want to keep as
DataFrame
then maybe first put all items on list and later convert this list toDataFrame
Because JSON keep
address
as directory{'postCode': ... , ...}
so some columns may have it as directorySee:
{ }
inaddress
,services
,storeHours
,etcIt may need also to convert it to separated rows.
and concat it with original
df
The same way you may do with other columns.
我之前遇到了同样的问题,您说明了其中一种解决方案,
建议搜索域/stitemap.xml和domain/robots.txt以获取可用的商店。
有时,数据也存储在.js请求中,因此打开网络选项卡并搜索商店的一个ID之一。
I had the same issue before and you stated one of the solutions,
I recommend searching the domain/sitemap.xml and domain/robots.txt for the available stores.
also sometimes the data is stored in the .js requests so open the network tab and search for one of the stores' id.