网站从网站地图位置刮擦更多的基础数据?

发布于 2025-02-04 18:45:46 字数 531 浏览 1 评论 0 原文

目前,我已成功使用Python从竞争对手的网站上刮擦数据,以查找商店信息。该网站有一个地图,您可以在其中输入邮政编码,它将告诉您我当前位置区域中的所有商店。该网站通过使用以下链接发送get请求以撤销存储数据:

https://www.homedepot.com/storesearchservices/v2/storesearch?address=37028&radius=50&pagesize =

30 = 12345& PAGESIZE = 30。 我应该如何获取所有商店信息?通过邮政编码的数据集迭代以吸引所有商店,还是有更好的方法来迭代?我尝试扩展超过30页的大小,但看起来这是请求的限制。

Currently, I have successfully used python to scrape data from a competitor's website to find out store information. The website has a map where you can enter a zip code and it will tell you all the stores in the area of a my current location. The website sends a GET request to pull store data by using this link:

https://www.homedepot.com/StoreSearchServices/v2/storesearch?address=37028&radius=50&pagesize=30

My goal is to scrape all store information not just the imaginary zip code = 12345 & pagesize=30.
How should I go about getting all the store information? Would it be better to iterate through a dataset of zip codes to pull all the stores or is there a better way to do this? I've tried expanding past 30 page size but it looks like that is the limit on the request.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

离鸿 2025-02-11 18:45:46

此URL为JSON提供了“ CurrentPage”:1 ,这意味着它可以使用某种分页。

我添加了& pag = 2 ,并且看来它可以工作

第1页:

https://www.homedepot.com/storesearchservices/v2/storesearch?address=37028& amp.amp; radius = 250& page250&pagesize = 40 = 40 = 40&pagepage = 40&page =/a >

第2页:

page 3:

https://www.homedepot.com/storesearchsearchseachseachsercesseachserices/storesearchserceces/storesearchserceces/storesearchserceces/v2/v2/v2/storsearch?addresseachearch?半径= 250& pageize = 40& pag = 3

对于测试,我使用更大的 range = 250 “ recordCount”:123

我发现了它还可以使用 pageize = 40
对于更大的价值,它会发送带有错误消息的JSON。


编辑:

最少的工作代码:

没有用户代理

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0',
}

url = 'https://www.homedepot.com/StoreSearchServices/v2/storesearch'

payload = {
    'address': 37028,
    'radius': 250,
    'pagesize': 40,
    'page': 1,
}

page = 0

while True:

    page += 1
    print('--- page:', page, '---')
    
    payload['page'] = page
    response = requests.get(url, params=payload, headers=headers)
    
    data = response.json()

    print(data['searchReport'])
                        
    if "stores" not in data:
        break
    
    for number, item in enumerate(data['stores'], 1):
        print(f'{number:2} | phone: {item["phone"]} | zip: {item["address"]["postalCode"]}')

结果的页面块请求:

--- page: 1 ---
{'recordCount': 123, 'currentPage': 1, 'storesPerPage': 40}
 1 | phone: (931)906-2655 | zip: 37040
 2 | phone: (270)442-0817 | zip: 42001
 3 | phone: (615)662-7600 | zip: 37221
 4 | phone: (615)865-9600 | zip: 37115
 5 | phone: (615)228-3317 | zip: 37216
 6 | phone: (615)269-7800 | zip: 37204
 7 | phone: (615)824-2391 | zip: 37075
 8 | phone: (615)370-0730 | zip: 37027
 9 | phone: (615)889-7211 | zip: 37076
10 | phone: (615)599-4578 | zip: 37064

etc. 

--- page: 2 ---
{'recordCount': 123, 'currentPage': 2, 'storesPerPage': 40}
 1 | phone: (662)890-9470 | zip: 38654
 2 | phone: (502)964-1845 | zip: 40219
 3 | phone: (812)941-9641 | zip: 47150
 4 | phone: (812)282-0470 | zip: 47129
 5 | phone: (662)349-6080 | zip: 38637
 6 | phone: (502)899-3706 | zip: 40207
 7 | phone: (662)840-8390 | zip: 38866
 8 | phone: (502)491-3682 | zip: 40220
 9 | phone: (870)268-0619 | zip: 72404
10 | phone: (256)575-2100 | zip: 35768

etc.

如果要保留为 dataframe ,则可能首先将所有项目放在列表,然后以后将此列表转换为 dataFrame

# --- before loop ----

all_items = []

page = 0

# --- loop ----

while True:

    # ... code ...
    
    for number, item in enumerate(data['stores'], 1):
        print(f'{number:2} | phone: {item["phone"]} | zip: {item["address"]["postalCode"]}')
        all_items.append(item)

# --- after loop ----

import pandas as pd

df = pd.DataFrame(all_items)

print(df)

因为JSON Keep address AS Directory {'Post Code':...,...,...} 具有目录,

print(df.iloc[0])
storeId                                                             0726
name                                                     Clarksville, TN
phone                                                      (931)906-2655
address                {'postalCode': '37040', 'county': 'Montgomery'...
coordinates                        {'lat': 36.581677, 'lng': -87.300826}
services               {'loadNGo': True, 'propane': True, 'toolRental...
storeContacts                 [{'name': 'Brenda G.', 'role': 'Manager'}]
storeHours             {'monday': {'open': '6:00', 'close': '21:00'},...
url                           /l/Clarksville-TN/TN/Clarksville/37040/726
distance                                                       32.530296
proDeskPhone                                               (931)920-9400
flags                  {'bopisFlag': True, 'assemblyFlag': True, 'bos...
marketNbr                                                           0019
axGeoCode                                                             00
storeTimeZone                                                    CST6CDT
curbsidePickupHours    {'monday': {'open': '09:00', 'close': '18:00'}...
storeOpenDt                                                   1998-08-13
storeType                                                         retail
toolRentalPhone                                                      NaN

请参见: {} in 地址 services StoreHours

列可能 将其转换为分离的行。

df['address'].apply(pd.Series)

并将其与原始 df

df2 = pd.concat( [df, df['address'].apply(pd.Series)], axis=1 )

与其他列进行相同的方式。

This url gives JSON with "currentPage":1 which can means it can use some kind of pagination.

I added &page=2 and it seems it works

Page 1:

https://www.homedepot.com/StoreSearchServices/v2/storesearch?address=37028&radius=250&pagesize=40&page=1

Page 2:

https://www.homedepot.com/StoreSearchServices/v2/storesearch?address=37028&radius=250&pagesize=40&page=2

Page 3:

https://www.homedepot.com/StoreSearchServices/v2/storesearch?address=37028&radius=250&pagesize=40&page=3

For test I use bigger range=250 to get JSON with "recordCount":123

I found that it works also with pagesize=40.
For bigger value it sends JSON with error message.


EDIT:

Minimal working code:

Page blocks request without User-Agent

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0',
}

url = 'https://www.homedepot.com/StoreSearchServices/v2/storesearch'

payload = {
    'address': 37028,
    'radius': 250,
    'pagesize': 40,
    'page': 1,
}

page = 0

while True:

    page += 1
    print('--- page:', page, '---')
    
    payload['page'] = page
    response = requests.get(url, params=payload, headers=headers)
    
    data = response.json()

    print(data['searchReport'])
                        
    if "stores" not in data:
        break
    
    for number, item in enumerate(data['stores'], 1):
        print(f'{number:2} | phone: {item["phone"]} | zip: {item["address"]["postalCode"]}')

Result:

--- page: 1 ---
{'recordCount': 123, 'currentPage': 1, 'storesPerPage': 40}
 1 | phone: (931)906-2655 | zip: 37040
 2 | phone: (270)442-0817 | zip: 42001
 3 | phone: (615)662-7600 | zip: 37221
 4 | phone: (615)865-9600 | zip: 37115
 5 | phone: (615)228-3317 | zip: 37216
 6 | phone: (615)269-7800 | zip: 37204
 7 | phone: (615)824-2391 | zip: 37075
 8 | phone: (615)370-0730 | zip: 37027
 9 | phone: (615)889-7211 | zip: 37076
10 | phone: (615)599-4578 | zip: 37064

etc. 

--- page: 2 ---
{'recordCount': 123, 'currentPage': 2, 'storesPerPage': 40}
 1 | phone: (662)890-9470 | zip: 38654
 2 | phone: (502)964-1845 | zip: 40219
 3 | phone: (812)941-9641 | zip: 47150
 4 | phone: (812)282-0470 | zip: 47129
 5 | phone: (662)349-6080 | zip: 38637
 6 | phone: (502)899-3706 | zip: 40207
 7 | phone: (662)840-8390 | zip: 38866
 8 | phone: (502)491-3682 | zip: 40220
 9 | phone: (870)268-0619 | zip: 72404
10 | phone: (256)575-2100 | zip: 35768

etc.

If you want to keep as DataFrame then maybe first put all items on list and later convert this list to DataFrame

# --- before loop ----

all_items = []

page = 0

# --- loop ----

while True:

    # ... code ...
    
    for number, item in enumerate(data['stores'], 1):
        print(f'{number:2} | phone: {item["phone"]} | zip: {item["address"]["postalCode"]}')
        all_items.append(item)

# --- after loop ----

import pandas as pd

df = pd.DataFrame(all_items)

print(df)

Because JSON keep address as directory {'postCode': ... , ...} so some columns may have it as directory

print(df.iloc[0])
storeId                                                             0726
name                                                     Clarksville, TN
phone                                                      (931)906-2655
address                {'postalCode': '37040', 'county': 'Montgomery'...
coordinates                        {'lat': 36.581677, 'lng': -87.300826}
services               {'loadNGo': True, 'propane': True, 'toolRental...
storeContacts                 [{'name': 'Brenda G.', 'role': 'Manager'}]
storeHours             {'monday': {'open': '6:00', 'close': '21:00'},...
url                           /l/Clarksville-TN/TN/Clarksville/37040/726
distance                                                       32.530296
proDeskPhone                                               (931)920-9400
flags                  {'bopisFlag': True, 'assemblyFlag': True, 'bos...
marketNbr                                                           0019
axGeoCode                                                             00
storeTimeZone                                                    CST6CDT
curbsidePickupHours    {'monday': {'open': '09:00', 'close': '18:00'}...
storeOpenDt                                                   1998-08-13
storeType                                                         retail
toolRentalPhone                                                      NaN

See: { } in address, services, storeHours,etc

It may need also to convert it to separated rows.

df['address'].apply(pd.Series)

and concat it with original df

df2 = pd.concat( [df, df['address'].apply(pd.Series)], axis=1 )

The same way you may do with other columns.

咋地 2025-02-11 18:45:46

我之前遇到了同样的问题,您说明了其中一种解决方案,

建议搜索域/stitemap.xml和domain/robots.txt以获取可用的商店。

有时,数据也存储在.js请求中,因此打开网络选项卡并搜索商店的一个ID之一。

I had the same issue before and you stated one of the solutions,

I recommend searching the domain/sitemap.xml and domain/robots.txt for the available stores.

also sometimes the data is stored in the .js requests so open the network tab and search for one of the stores' id.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文