当前位置：文江博客话题详情

在nutch 1.3中如何使用不同的计划爬行重新爬行不同的网站？

发布于 2024-12-11 00:48:54 字数 93 浏览 3 评论 0原文

我有很多网站；有些内容每月都会变化，有些内容每天都会变化。 nutch 1.3 之前已经抓取过它们，现在我想用不同的计划抓取来重新抓取它们。我怎样才能做到这一点？谢谢。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

如果没有你 2024-12-18 00:48:54

您可以编写一个shell脚本，在其中指定用于运行爬虫的命令名称，并在linux中使用cron命令来安排该脚本的执行。

http://www.thegeekstuff.com/2011/07/cron-每5分钟/

甚至谷歌也会在一段时间间隔后重复抓取整个网络。

回复收藏 0 原文

耳根太软 2024-12-18 00:48:54

您可以为种子文件中的每个条目指定获取间隔（两次连续抓取之间的时间），如下所示：

http://daily.com \t nutch.fetchInterval=86400
http://montly.com \t nutch.fetchInterval=2592000

如果您使用的是 AdaptiveFetchSchedule，则上述条目只需设置每次重新抓取后的起始间隔，具体取决于是否页面更改或不更改此间隔将增加或减少。在这种情况下，如果您始终需要固定间隔，则可以在上面几行中使用 nutch.fetchInterval.fixed 而不是 nutch.fetchInterval 。

You can specify fetch interval (time between two consecutive crawls) for each entry in your seed file like this:

http://daily.com \t nutch.fetchInterval=86400
http://montly.com \t nutch.fetchInterval=2592000

If you are using AdaptiveFetchSchedule the above entries just set the starting interval and after each recrawl depending on whether the page is changed or not this interval will be increased or decreased. In this case, if you always want a fixed interval you can use nutch.fetchInterval.fixed instead of nutch.fetchInterval in above lines.

回复收藏 0 原文

~没有更多了~