Nutch 网络蜘蛛，索引整个网络

发布于 2024-10-22 08:34:30 字数 281 浏览 6 评论 0原文

好吧，我一直在搞 Nutch 并且需要知道 crawl-urlfilter 中的参数是什么我编辑了 .txt 文件，以便蜘蛛没有边界。换句话说，我希望它在指定域之外的网络上漫游。

我假设它与这一行有关，但我不知道如何正确编辑它以按照我的意愿进行：

+^http://([a-z0-9]*\.)*urlz.net/

原文

Alright, I've been messing around with Nutch and need to know what parameter inside the crawl-urlfilter.txt file I edit so the spider has no boundaries. In other words I want it to roam around the web outside of a specified domain.

I'm assuming it has to do with this line, but I don't know how to edit it correctly to do as I want it to:

+^http://([a-z0-9]*\.)*urlz.net/

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

顾北清歌寒 2024-10-29 08:34:30

我不熟悉 Nutch，但这只是一个正则表达式。

+^http://([a-z0-9\.])*

可能会工作得很好，或者有一些变化。它只是匹配一个模式。我刚刚在上面写的应该匹配以 http:// 开头的任何内容，然后是任意数量的字母、数字或点。

I'm not framiliar with Nutch but this is just a regular expression.

+^http://([a-z0-9\.])*

Would probably work just fine, or some variation thereof. Its just matching a pattern. The one I just wrote above should match anything starting with http:// and then any number of letters, numbers or dots.

回复收藏 0 原文

~没有更多了~