快速重新抓取网站
我正在开发一个系统,该系统必须跟踪几个门户的内容并每晚检查更改(例如下载并索引白天添加的新网站)。该门户的内容将被编入索引以供搜索。问题在于重新抓取该门户 - 第一次抓取门户需要很长时间(门户示例:www.onet.pl、www.bankier.pl、www.gazeta.pl)并且我想重新抓取它更快(尽可能快),例如通过检查修改日期,但我使用wget下载www.bankier.pl,但作为回应,它抱怨没有最后一个- 修改标头。 有什么办法可以重新抓取这么多网站吗?我也尝试过使用 Nutch,但重新抓取的脚本似乎无法正常工作 - 或者它也取决于此标头(最后修改的)。 也许有一种工具,爬虫(如 Nutch 或其他东西)可以通过添加新站点来更新已下载的站点?
此致, 沃伊泰克
I am developing a system that has to track content of few portals and check changes every night (for example download and index new sites that have been added during the day). Content of this portals will be indexed for searching. The problem is in re-crawling this portals - first crawling of portal takes very long (examples of portals: www.onet.pl, www.bankier.pl, www.gazeta.pl ) and I want to re-crawl it faster (as fast as it is possible) for example by checking date of modification but I have used wget to download www.bankier.pl but in response it complains that there is no last-modification header.
Is there any way to re-crawl so many sites? I have also tried using Nutch but script for re-clawing seems not work properly - or it also depends on this headers (last-modified).
Maybe there is a tool, crawler (like Nutch or something) that can update already downloaded sites by adding new one??
Best regards,
Wojtek
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我建议使用 curl 仅获取头部并检查 Last-Modified 标头是否已更改。
例子:
I recommend using curl to fetch only the head and check if the Last-Modified header has changed.
Example:
对于 Nutch,我写了一篇关于 如何使用 Nutch 重新抓取。基本上,您应该为 db.fetch.interval.default 设置设置一个较低的值。下次获取 URL 时,Nutch 将使用上次获取时间作为 If-Modified-Since HTTP 标头的值。
For Nutch, I have written a blog post on how to re-crawl with Nutch. Basically, you should set a low value for the db.fetch.interval.default setting. On the next fetch of a url, Nutch will use the last fetch time as the value for the If-Modified-Since HTTP header.