Nutch 网络蜘蛛,索引整个网络
好吧,我一直在搞 Nutch 并且需要知道 crawl-urlfilter 中的参数是什么我编辑了 .txt
文件,以便蜘蛛没有边界。换句话说,我希望它在指定域之外的网络上漫游。
我假设它与这一行有关,但我不知道如何正确编辑它以按照我的意愿进行:
+^http://([a-z0-9]*\.)*urlz.net/
Alright, I've been messing around with Nutch and need to know what parameter inside the crawl-urlfilter.txt
file I edit so the spider has no boundaries. In other words I want it to roam around the web outside of a specified domain.
I'm assuming it has to do with this line, but I don't know how to edit it correctly to do as I want it to:
+^http://([a-z0-9]*\.)*urlz.net/
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我不熟悉 Nutch,但这只是一个正则表达式。
可能会工作得很好,或者有一些变化。它只是匹配一个模式。我刚刚在上面写的应该匹配以 http:// 开头的任何内容,然后是任意数量的字母、数字或点。
I'm not framiliar with Nutch but this is just a regular expression.
Would probably work just fine, or some variation thereof. Its just matching a pattern. The one I just wrote above should match anything starting with http:// and then any number of letters, numbers or dots.