当前位置：文江博客话题详情

如何制作Nutch抓取文件系统？

发布于 2024-07-23 05:11:38 字数 151 浏览 3 评论 0原文

不基于http，

如http://localhost:81等，

而是直接抓取本地文件上的某个目录系统，

还有什么办法吗？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

伊面 2024-07-30 05:11:38

来自 Nutch Wiki：

如何索引我的本地文件系统？

http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6

1) .txt 需要更改以允许文件： URL 而不是遵循 http: 的 URL，否则它不会索引任何内容，或者会从磁盘跳到网站上。
更改此行：

  -^(file|ftp|mailto|https):

  to this:

  -^(http|ftp|mailto|https):

2)crawl-urlfilter.txt 底部可能有拒绝某些 URL 的规则。如果它有这个片段，可能没问题：

  # accept anything else +.*

3）我更改了 nutch.xml 以包含以下内容：

<Parameter override="false" name="plugin.includes" value="protocol-file|protocol-http|urlfilter-regex|parse-(msword|pdf|text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)"/>

From the Nutch Wiki:

How do I index my local file system?

http://wiki.apache.org/nutch/FAQ#head-c721b23b43b15885f5ea7d8da62c1c40a37878e6

1) crawl-urlfilter.txt needs a change to allow file: URLs while not following http: ones, otherwise it either won't index anything, or it'll jump off your disk onto web sites.
Change this line:

  -^(file|ftp|mailto|https):

  to this:

  -^(http|ftp|mailto|https):

2) crawl-urlfilter.txt may have rules at the bottom to reject some URLs. If it has this fragment it's probably ok:

  # accept anything else +.*

3) I changed my nutch.xml to include the following:

<Parameter override="false" name="plugin.includes" value="protocol-file|protocol-http|urlfilter-regex|parse-(msword|pdf|text|html|js)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)"/>

回复收藏 0 原文