如何制作Nutch抓取文件系统?
not based on http,
like http://localhost:81 and so on,
but directly crawl a certain directory on local file system,
is there any way out?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
来自 Nutch Wiki:
如何索引我的本地文件系统?
http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6
1) .txt 需要更改以允许文件: URL 而不是遵循 http: 的 URL,否则它不会索引任何内容,或者会从磁盘跳到网站上。
更改此行:
2)crawl-urlfilter.txt 底部可能有拒绝某些 URL 的规则。 如果它有这个片段,可能没问题:
3)我更改了 nutch.xml 以包含以下内容:
From the Nutch Wiki:
How do I index my local file system?
http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6
1) crawl-urlfilter.txt needs a change to allow file: URLs while not following http: ones, otherwise it either won't index anything, or it'll jump off your disk onto web sites.
Change this line:
2) crawl-urlfilter.txt may have rules at the bottom to reject some URLs. If it has this fragment it's probably ok:
3) I changed my nutch.xml to include the following:
nutch可以进行内网爬取。 您可以在此处阅读详细信息
nutch has the Intranet crawling available. you can read the details here