Python Selenium Scraper:分页到下一页显示错误。网站上的废品保护?
我在AWS上的Lambda功能中运行了Python Selenium脚本。
我正在删除此页面:
刮板本身正常工作。 但是下一页的分页停止了工作。它以前工作了多个月。
屏幕截图
png = driver.get_screenshot_as_base64()
我运行此代码(简化版本):
while url:
driver.get(url)
png = driver.get_screenshot_as_base64()
print(png)
button_next = driver.find_elements_by_class_name("PaginationArrowLink-sc-imp866-0")
print("button_next_url: " + str(button_next[-1].get_attribute("href")))
try:
url = button_next[-1].get_attribute("href")
except:
url=""
print('Error in URL')
有趣的是,打印的URL完全可以,当我在浏览器中手动打开它时,它会加载第2页:
https://www.stepstone.de/5/ergebnisliste.html?what=Berufskraftfahrer&searchorigin=Resultlist_top-search&suid=1faad076-5348-48d8-9834-4e0d9a836e34&of=25&action=paging_next
但是“ driver.get(url)”导致屏幕截图上的错误页面。
这是网站上的某种刮擦保护吗?还是它有另一个原因从一天到另一天的工作?
I'm running a python selenium script in a lambda function on AWS.
I'm scraping this page: Link
The scraper itself is working fine.
But the pagination to the next page stopped working. It worked before for many months.
I exported a screenshot via:
png = driver.get_screenshot_as_base64()
It shows this page instead of the second page:
I run this code (simplified version):
while url:
driver.get(url)
png = driver.get_screenshot_as_base64()
print(png)
button_next = driver.find_elements_by_class_name("PaginationArrowLink-sc-imp866-0")
print("button_next_url: " + str(button_next[-1].get_attribute("href")))
try:
url = button_next[-1].get_attribute("href")
except:
url=""
print('Error in URL')
The interesting thing is the printed URL is totally fine and when I open it manually in the browser it loads page 2:
https://www.stepstone.de/5/ergebnisliste.html?what=Berufskraftfahrer&searchorigin=Resultlist_top-search&suid=1faad076-5348-48d8-9834-4e0d9a836e34&of=25&action=paging_next
But "driver.get(url)" leads to the error page on the screenshot.
Is this some sort of scrape protection from the website? Or is there another reason it sopped working from one day to the other?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
解决方案是切割URL的最后部分。
来自:
到:
我仍然不明白为什么硒无法加载它,但手动起作用。但是现在它再次运行。
The solution was to cut the last part of the URL.
from:
to:
I still don't understand why Selenium was not able to load it, but manually it works. But now it is running again.