如果第 2、3、4 页都具有相同的 URL,我如何解析该网站的所有 400 个 francis?

发布于 2025-01-17 18:29:26 字数 840 浏览 0 评论 0 原文

我正在对网站进行网络报废 https://www.franchisetimes.com/top-400- 2021/,我需要在每个特许经营权内抓取数据,我正在构建主体(尚未进行实际的废弃),但无法解析任何内容超越特许经营#25,我不知道如何推进下一页。

预先感谢您的意见和建议。

所以我被困在这里:

from bs4 import BeautifulSoup as bs
import requests

DOMAIN = 'https://www.franchisetimes.com'
URL = 'https://www.franchisetimes.com/top-400-2021/'
FILETYPE = '.html'

def get_soup(URL):
    return bs(requests.get(url).text, 'html.parser')

#get_soup(DOMAIN)

i=0
for link in get_soup(URL).find_all('a'):
    file_link = link.get('href')
    try:
        if "top-400-2021" in file_link and not "block_id" in file_link and FILETYPE in file_link:
            i += 1
            print(file_link)
            print(i)
        except:
            print("nonetype")

I am doing web scrapping the website https://www.franchisetimes.com/top-400-2021/ and I need to web-scrape the data inside each of the franchises, I'm building the body (haven't gotten to the actual scrapping yet) but can't parse anything beyond franchise #25 and I don't know how to advance the next pages.

Thanks in advance for your comments and suggestions.

So I'm stuck here:

from bs4 import BeautifulSoup as bs
import requests

DOMAIN = 'https://www.franchisetimes.com'
URL = 'https://www.franchisetimes.com/top-400-2021/'
FILETYPE = '.html'

def get_soup(URL):
    return bs(requests.get(url).text, 'html.parser')

#get_soup(DOMAIN)

i=0
for link in get_soup(URL).find_all('a'):
    file_link = link.get('href')
    try:
        if "top-400-2021" in file_link and not "block_id" in file_link and FILETYPE in file_link:
            i += 1
            print(file_link)
            print(i)
        except:
            print("nonetype")

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

夕嗳→ 2025-01-24 18:29:26

它使用javaScript从

https://www.franchisetimes.com/search/?bl=1111254& amp; amp; amp; amp;l=25& f = json& altf =

widget 代码>在中firefox / chrome (TAB:网络,filter: xhr

如果使用 o o = 25 而不是 o = 0 然后,如果您使用 o = 50 ,则获得第二页的JSON数据,然后在第三页上等等。

import requests

payload = {
    'bl': '1111254',
    'o': 0,
    'l': 25,
    'f': 'json',
    'altf': 'widget',
}

url = 'https://www.franchisetimes.com/search/'

for offset in range(0, 400, 25):
    print('\n--- offset:', offset, '---\n')
    
    payload['o'] = offset
    response = requests.get(url, params=payload)
    data = response.json()
    for item in data['assets']:
        print(item['title'])

--- offset: 0 ---

1. McDonald’s
2. 7-Eleven
3. KFC
4. Ace Hardware
5. Burger King
6. Domino's
7. Circle K
8. Chick-fil-A
9. Subway
10. Pizza Hut
11. Taco Bell
12. RE/MAX
13. Wendy’s
14. Keller Williams Realty
15. Dunkin’
16. Marriott Hotels & Resorts
17. Sonic Drive-In
18. Tim Hortons
19. Popeyes Louisiana Kitchen
20. Panera Bread
21. Dairy Queen
22. Little Caesars
23. Hampton by Hilton
24. Holiday Inn Express
25. Arby’s

--- offset: 25 ---

26. Papa John’s
27. Hyatt
28. Jack In The Box
29. Courtyard
30. Berkshire Hathaway HomeServices
31. Chili's
32. Hilton Hotels & Resorts
33. Buffalo Wild Wings
34. Applebee’s
35. Express Employment Professionals
36. The UPS Store
37. SERVPRO
38. Paris Baguette
39. Whataburger
40. Holiday Inn Hotels & Resorts
41. Outback Steakhouse
42. Residence Inn
43. H&R Block
44. Comfort Inn & Suites
45. Planet Fitness
46. Five Guys
47. IHOP
48. Home Instead Senior Care
49. Aaron’s
50. Baskin Robbins

--- offset: 50 ---

51. Renaissance
52. Zaxby’s
53. Hardee’s
54. G.J. Gardner Homes
55. Culver’s Butterburgers & Frozen Custard
56. Wingstop
57. Jimmy John’s
58. DoubleTree by Hilton
59. Denny’s
60. Jiffy Lube
61. Quality Inn & Suites
62. Snap-on Tools
63. Jersey Mike’s Subs
64. HomeVestors
65. Carl’s Jr.
66. Midas
67. Roto-Rooter
68. Anytime Fitness
69. Valvoline Instant Oil Change
70. ampm
71. Bojangles’ Famous Chicken 'n Biscuits
72. InterContinental Hotels & Resorts
73. Church’s Chicken
74. Crowne Plaza Hotels & Resorts
75. La Quinta Inn & Suites

--- offset: 75 ---

76. Pet Supplies Plus
77. Super 8
78. CARSTAR
79. Days Inn
80. Orangetheory Fitness
81. Interim HealthCare
82. Red Robin
83. Great Clips
84. Massage Envy
85. Big O Tires
86. Paul Davis Restoration
87. Home2 Suites by Hilton
88. Color Glo International
89. El Pollo Loco
90. Window World
91. Firehouse Subs
92. Checkers/Rally’s
93. American Family Care
94. Del Taco
95. Boston Pizza
96. Qdoba Mexican Eats
97. Linc Service
98. Papa Murphy's
99. Marco’s Pizza
100. Ramada

结果 。


['title', 'uuid', 'published', 'type', 'url', 'canonical', 'byline', 'starttime', 'updated', 'last_updated', 'pretty_date', 'kicker', 'hammer', 'keywords', 'flags', 'comment_count', 'time_to_consume', 'preview', 'new_window', 'is_premium', 'icon', 'rank', 'name', 'sales', 'loc', 'sections', 'summary', 'authors']

item["rank"]    "1"
item["name"]    "McDonald’s"
item["sales"]   "$93,317,000,000"
item["loc"]     "39,198"
# iamge with logo
item["preview"]["url"]  "https://bloximages.newyork1.vip.townnews.com/franchisetimes.com/content/tncms/assets/v3/editorial/8/ab/8ab10429-92c1-56e8-91fc-0ecf5f5504cd/5f7e8e875707b.image.jpg"

It uses JavaScript to load JSON data from

https://www.franchisetimes.com/search/?bl=1111254&o=0&l=25&f=json&altf=widget

(I found it using DevTools in Firefox/Chrome (tab: network, filter: XHR)

If you use o=25 instead of o=0 then you get JSON data for second page, if you use o=50 then for third page, etc.

import requests

payload = {
    'bl': '1111254',
    'o': 0,
    'l': 25,
    'f': 'json',
    'altf': 'widget',
}

url = 'https://www.franchisetimes.com/search/'

for offset in range(0, 400, 25):
    print('\n--- offset:', offset, '---\n')
    
    payload['o'] = offset
    response = requests.get(url, params=payload)
    data = response.json()
    for item in data['assets']:
        print(item['title'])

Result:

--- offset: 0 ---

1. McDonald’s
2. 7-Eleven
3. KFC
4. Ace Hardware
5. Burger King
6. Domino's
7. Circle K
8. Chick-fil-A
9. Subway
10. Pizza Hut
11. Taco Bell
12. RE/MAX
13. Wendy’s
14. Keller Williams Realty
15. Dunkin’
16. Marriott Hotels & Resorts
17. Sonic Drive-In
18. Tim Hortons
19. Popeyes Louisiana Kitchen
20. Panera Bread
21. Dairy Queen
22. Little Caesars
23. Hampton by Hilton
24. Holiday Inn Express
25. Arby’s

--- offset: 25 ---

26. Papa John’s
27. Hyatt
28. Jack In The Box
29. Courtyard
30. Berkshire Hathaway HomeServices
31. Chili's
32. Hilton Hotels & Resorts
33. Buffalo Wild Wings
34. Applebee’s
35. Express Employment Professionals
36. The UPS Store
37. SERVPRO
38. Paris Baguette
39. Whataburger
40. Holiday Inn Hotels & Resorts
41. Outback Steakhouse
42. Residence Inn
43. H&R Block
44. Comfort Inn & Suites
45. Planet Fitness
46. Five Guys
47. IHOP
48. Home Instead Senior Care
49. Aaron’s
50. Baskin Robbins

--- offset: 50 ---

51. Renaissance
52. Zaxby’s
53. Hardee’s
54. G.J. Gardner Homes
55. Culver’s Butterburgers & Frozen Custard
56. Wingstop
57. Jimmy John’s
58. DoubleTree by Hilton
59. Denny’s
60. Jiffy Lube
61. Quality Inn & Suites
62. Snap-on Tools
63. Jersey Mike’s Subs
64. HomeVestors
65. Carl’s Jr.
66. Midas
67. Roto-Rooter
68. Anytime Fitness
69. Valvoline Instant Oil Change
70. ampm
71. Bojangles’ Famous Chicken 'n Biscuits
72. InterContinental Hotels & Resorts
73. Church’s Chicken
74. Crowne Plaza Hotels & Resorts
75. La Quinta Inn & Suites

--- offset: 75 ---

76. Pet Supplies Plus
77. Super 8
78. CARSTAR
79. Days Inn
80. Orangetheory Fitness
81. Interim HealthCare
82. Red Robin
83. Great Clips
84. Massage Envy
85. Big O Tires
86. Paul Davis Restoration
87. Home2 Suites by Hilton
88. Color Glo International
89. El Pollo Loco
90. Window World
91. Firehouse Subs
92. Checkers/Rally’s
93. American Family Care
94. Del Taco
95. Boston Pizza
96. Qdoba Mexican Eats
97. Linc Service
98. Papa Murphy's
99. Marco’s Pizza
100. Ramada

etc.


If you display data['assets'][0].keys() then you see what else you get in data

['title', 'uuid', 'published', 'type', 'url', 'canonical', 'byline', 'starttime', 'updated', 'last_updated', 'pretty_date', 'kicker', 'hammer', 'keywords', 'flags', 'comment_count', 'time_to_consume', 'preview', 'new_window', 'is_premium', 'icon', 'rank', 'name', 'sales', 'loc', 'sections', 'summary', 'authors']

For example:

item["rank"]    "1"
item["name"]    "McDonald’s"
item["sales"]   "$93,317,000,000"
item["loc"]     "39,198"
# iamge with logo
item["preview"]["url"]  "https://bloximages.newyork1.vip.townnews.com/franchisetimes.com/content/tncms/assets/v3/editorial/8/ab/8ab10429-92c1-56e8-91fc-0ecf5f5504cd/5f7e8e875707b.image.jpg"
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文