我应该如何安排一天内多次 Google 搜索抓取?
目前,我的 Nokogiri 脚本会迭代 Google 的 SERP,直到找到目标网站的位置。它对每个用户指定的每个网站的每个关键字执行此操作(用户可以跟踪的网站和关键字的数量受到限制)。
目前,它每天都在硬排程的 rake 中运行,并通过循环遍历数据库中的所有网站来一次性批量处理所有抓取内容。但我担心可扩展性以及向 Google 发送大量请求。
我想要一个可以扩展并可以在一天中运行这些抓取的解决方案。我不确定有什么可用的解决方案或我真正在寻找什么。
注意:随着用户添加和删除网站和关键字,网站/关键字的数量每天都会发生变化。我并不是想让这个问题变得多余,但这就是 Beanstalkd/Stalker(作业队列)可以用来做的事情吗?
Currently, my Nokogiri script iterates through Google's SERPs until it finds the position of the target website. It does this for each keyword for each website that each user specifies (users are capped on amount of websites & keywords they can track).
Right now, it's run in a rake that's hard-scheduled every day and batches all scrapes at once by looping through all the websites in the database. But I'm concerned about scalability and swarming Google with a batch of requests.
I'd like a solution that scales and can run these scrapes over the course of the day. I'm not sure what kind of solution is available or what I'm really looking for.
Note: The amount of websites/keywords change from day to day as users add and delete their websites and keywords. I don't mean to make this question too superfluous, but is this the kind of thing Beanstalkd/Stalker (job queuing) can be used for?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您必须平衡两个问题:大量用户的可扩展性与谷歌因违反其使用条款而关闭您的问题。
因此,您的系统需要能够将任务分配到各种不同的 IP,以隐藏您的批量抓取,这表明至少有两级排队。一个管理所有作业并将它们发送到每个单独的 IP 以进行后续搜索和收集结果,并在每台单独的计算机上进行队列以保存请求的搜索,直到执行它们并返回结果。
我不知道谷歌的阈值是什么(我确信他们不会宣传它),但超过它们并被切断显然会对你想要做的事情造成毁灭性的影响,所以你的简单循环耙子任务正是你不应该做的达到一定数量的用户后才进行。
所以,是的,使用某种队列,但要意识到您的目标可能与队列的典型目标不同,因为您想要故意延迟作业而不是卸载单词以避免 UI 延迟。因此,您将寻求减慢队列速度的方法,而不是让它在作业到达队列时只执行一个又一个作业。
因此,根据对 DelayedJob 和 BackgroundJobs 的粗略检查,看起来 DelayedJob 具有您需要的
run_at
属性。但我在这里只是猜测,我相信专家会有更多话要说。You will have to balance two issues: Scalability for lots of users versus Google shutting you down for scaping in violation of their terms of use.
So your system will need to be able to distribute tasks to various different IPs to conceal your bulk scraping which suggests at least two levels of queuing. One to manage all the jobs and send them to each separate IP for subsequent searching and collecting results and queues on each separate machine to hold the requested searches until they are executed and the results returned.
I have no idea what Google's thresholds are (I am sure they don't advertise it) but exceeding them and getting cut off would obviously be devastating for what you are trying to do so your simple looping rake task is exactly what you shouldn't do after a certain number of users.
So yes, use a queue of some sort but realize that you probably have a different goal from the typical goal of a queue in that you want to deliberately delay jobs rather that offload word to avoid UI delays. So you will be seeking ways to slow down the queue rather than have it just execute job after job as they arrive in the queue.
So based on a cursory inspection of DelayedJob and BackgroundJobs it looks like DelayedJob has what you would need with the
run_at
attribute. But I am only speculating here and I am sure an expert would have more to say.如果我理解正确的话,听起来这些工具之一可能符合要求:
Delayed_job:https://github。 com/tobi/delayed_job
或
后台作业:http://codeforpeople.rubyforge.org /svn/bj/trunk/README
我已经使用过它们,并且发现它们很容易使用。
If I'm understanding correclty, it sounds like one of these tools might fit the bill:
Delayed_job: https://github.com/tobi/delayed_job
or
BackgroundJobs: http://codeforpeople.rubyforge.org/svn/bj/trunk/README
I've used both of them, and found them easy to work with.
肯定有一些后台作业库可能有效。
但是,您可能会考虑只安排一个在运行期间运行更多次的 Cron 作业。日,并且每次运行处理的项目较少。
There are definitely some background job libraries that might work.
However, you might think about just scheduling a Cron job that runs more times during the day, and processes less items per run.
SaaS 解决方案:http://momentapp.com/“通过计划的 http 请求启动延迟作业”- 测试版中的免责声明 a) b) 我不隶属于此服务
SaaS solution: http://momentapp.com/ "Launch delayed jobs with scheduled http requests" - disclaimer a) in beta b) I am not affiliated with this service