做一个好公民和网络抓取

发布于 2024-12-21 19:05:36 字数 459 浏览 0 评论 0原文

我有一个由两部分组成的问题。

首先,我正在编写一个基于 的网络爬虫Scrapy 中的 CrawlSpider 蜘蛛。我的目标是抓取一个拥有数千条(可能是数十万条)记录的网站。这些记录埋藏在起始页以下 2-3 层。所以基本上我让蜘蛛从某个页面开始,爬行直到找到特定类型的记录,然后解析 html。我想知道有哪些方法可以防止我的蜘蛛使网站超载?有没有可能有一种方法可以增量地执行操作或在不同的请求之间暂停?

其次,相关的是,是否有一种方法可以使用 Scrapy 来测试爬虫而不会对网站造成过度的压力?我知道您可以在程序运行时终止该程序,但是有没有办法让脚本在点击包含我想要抓取的信息的第一页之类的内容后停止?

任何建议或资源将不胜感激。

I have a two part question.

First, I'm writing a web-scraper based on the CrawlSpider spider in Scrapy. I'm aiming to scrape a website that has many thousands (possible into the hundreds of thousands) of records. These records are buried 2-3 layers down from the start page. So basically I have the spider start on a certain page, crawl until it finds a specific type of record, and then parse the html. What I'm wondering is what methods exist to prevent my spider from overloading the site? Is there possibly a way to do thing's incrementally or put a pause in between different requests?

Second, and related, is there a method with Scrapy to test a crawler without placing undue stress on a site? I know you can kill the program while it runs, but is there a way to make the script stop after hitting something like the first page that has the information I want to scrape?

Any advice or resources would be greatly appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

堇年纸鸢 2024-12-28 19:05:36

有没有可能有一种方法可以增量地做事情

我正在使用 Scrapy 缓存功能来增量地抓取站点

HTTPCACHE_ENABLED = True

或者您可以使用新的 0.14 功能 作业:暂停和恢复抓取

或者在不同的请求之间暂停?

检查此设置:

DOWNLOAD_DELAY    
RANDOMIZE_DOWNLOAD_DELAY

有没有一种方法可以使用 Scrapy 来测试爬虫而不会对网站造成过度的压力?

您可以尝试在 Scrapy shell 中调试代码

我知道您可以在程序运行时终止该程序,但是有没有办法让脚本在遇到诸如包含我想要抓取的信息的第一页之类的内容后停止?

另外,您可以致电

任何建议或资源将不胜感激。

Scrapy 文档是最好的资源。

Is there possibly a way to do thing's incrementally

I'm using Scrapy caching ability to scrape site incrementaly

HTTPCACHE_ENABLED = True

Or you can use new 0.14 feature Jobs: pausing and resuming crawls

or put a pause in between different requests?

check this settings:

DOWNLOAD_DELAY    
RANDOMIZE_DOWNLOAD_DELAY

is there a method with Scrapy to test a crawler without placing undue stress on a site?

You can try and debug your code in Scrapy shell

I know you can kill the program while it runs, but is there a way to make the script stop after hitting something like the first page that has the information I want to scrape?

Also, you can call scrapy.shell.inspect_response at any time in your spider.

Any advice or resources would be greatly appreciated.

Scrapy documentation is the best resource.

倾其所爱 2024-12-28 19:05:36

您必须开始爬行并记录所有内容。如果您被禁止,您可以在页面请求之前添加 sleep() 。

更改 User-Agent 也是一个很好的做法(http://www.user-agents.org/ http://www .useragentstring.com/

如果被ip禁止,可以使用代理绕过它。干杯。

You have to start crawling and log everything. In case you get banned, you can add sleep() before pages request.

Changing User-Agent is a good practise, too (http://www.user-agents.org/ http://www.useragentstring.com/ )

If you get banned by ip, use proxy to bypass it. Cheers.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文