从包括隐藏的网站(包括隐藏的网站)中提取所有URL
我想从网页中提取所有URL,包括出现在“隐藏”按钮后面的网页(请参阅图像)。
:也可以在搜索中包含多个页面(请参阅图像)
< img src =“ https://i.sstatic.net/i1bwy.png” alt =“第二问题示例”>
谢谢!
我设法从页面中提取了URL,但这不包括隐藏的URL。
req = Request('https://www.sainsburys.co.uk/gol-ui/SearchResults/vegan') #example
html_page = urlopen(req)
soup = BeautifulSoup(html_page, "lxml")
df = pandas.DataFrame()
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
set_df = set(links)
df['Urls'] = list(set_df)
df = df.sort_values("Urls")
I want to extract all Urls from a webpage including the ones appearing behind an 'hidden' button (see image).
Also: would it be possible to include multiple pages in a search (see image)
Thank you!
I managed to extract the urls from the pages, but this does not include the hidden ones.
req = Request('https://www.sainsburys.co.uk/gol-ui/SearchResults/vegan') #example
html_page = urlopen(req)
soup = BeautifulSoup(html_page, "lxml")
df = pandas.DataFrame()
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
set_df = set(links)
df['Urls'] = list(set_df)
df = df.sort_values("Urls")
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
使用API。然后以有效载荷参数迭代页面:
输出:
Use the api. Then iterate through the pages with payload parameter:
Output: