Python Selenium Scraper：分页到下一页显示错误。网站上的废品保护？

发布于 2025-02-08 23:34:22 字数 1166 浏览 2 评论 0原文

我在AWS上的Lambda功能中运行了Python Selenium脚本。

我正在删除此页面：

刮板本身正常工作。但是下一页的分页停止了工作。它以前工作了多个月。

屏幕截图

png = driver.get_screenshot_as_base64()

我通过：显示此页面而不是第二页的：

我运行此代码（简化版本）：

while url:
        driver.get(url)
        png = driver.get_screenshot_as_base64()
        print(png)
        button_next = driver.find_elements_by_class_name("PaginationArrowLink-sc-imp866-0")
        print("button_next_url: " + str(button_next[-1].get_attribute("href")))
        try:
            url = button_next[-1].get_attribute("href")
        except:
            url=""
            print('Error in URL')

有趣的是，打印的URL完全可以，当我在浏览器中手动打开它时，它会加载第2页：

https://www.stepstone.de/5/ergebnisliste.html?what=Berufskraftfahrer&searchorigin=Resultlist_top-search&suid=1faad076-5348-48d8-9834-4e0d9a836e34&of=25&action=paging_next

但是“ driver.get（url）”导致屏幕截图上的错误页面。

这是网站上的某种刮擦保护吗？还是它有另一个原因从一天到另一天的工作？

原文

I'm running a python selenium script in a lambda function on AWS.

I'm scraping this page: Link

The scraper itself is working fine.
But the pagination to the next page stopped working. It worked before for many months.

I exported a screenshot via:

png = driver.get_screenshot_as_base64()

It shows this page instead of the second page:

I run this code (simplified version):

while url:
        driver.get(url)
        png = driver.get_screenshot_as_base64()
        print(png)
        button_next = driver.find_elements_by_class_name("PaginationArrowLink-sc-imp866-0")
        print("button_next_url: " + str(button_next[-1].get_attribute("href")))
        try:
            url = button_next[-1].get_attribute("href")
        except:
            url=""
            print('Error in URL')

The interesting thing is the printed URL is totally fine and when I open it manually in the browser it loads page 2:

https://www.stepstone.de/5/ergebnisliste.html?what=Berufskraftfahrer&searchorigin=Resultlist_top-search&suid=1faad076-5348-48d8-9834-4e0d9a836e34&of=25&action=paging_next

But "driver.get(url)" leads to the error page on the screenshot.

Is this some sort of scrape protection from the website? Or is there another reason it sopped working from one day to the other?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

吲‖鸣 2025-02-15 23:34:22

解决方案是切割URL的最后部分。

来自：

https://www.stepstone.de/5/ergebnisliste.html?what=berufskraftfahrer&searchorigin=Resultlist_top-search&of=25&action=paging_next

到：

https://www.stepstone.de/5/ergebnisliste.html?what=berufskraftfahrer&searchorigin=Resultlist_top-search&of=25

我仍然不明白为什么硒无法加载它，但手动起作用。但是现在它再次运行。

The solution was to cut the last part of the URL.

from:

https://www.stepstone.de/5/ergebnisliste.html?what=berufskraftfahrer&searchorigin=Resultlist_top-search&of=25&action=paging_next

to:

https://www.stepstone.de/5/ergebnisliste.html?what=berufskraftfahrer&searchorigin=Resultlist_top-search&of=25

I still don't understand why Selenium was not able to load it, but manually it works. But now it is running again.

回复收藏 0 原文

~没有更多了~

关于作者

三生路

暂无简介

文章

28 人气

关注发私信

身边

文章 0 评论 0

关注

qq_oxT0yE

文章 0 评论 0

关注

卷着的草席

文章 0 评论 0

关注

￡冰雨忧蓝°

文章 0 评论 0

关注

我还不会笑

文章 0 评论 0

关注

Unbroken

文章 0 评论 0

友情链接

文江博客

Python Selenium Scraper：分页到下一页显示错误。网站上的废品保护？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

身边

qq_oxT0yE

卷着的草席

￡冰雨忧蓝°

我还不会笑

Unbroken

友情链接

Python Selenium Scraper：分页到下一页显示错误。网站上的废品保护？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（1）

关于作者

相关话题

热门标签

推荐作者

身边

qq_oxT0yE

卷着的草席

￡冰雨忧蓝°

我还不会笑

Unbroken

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。