如何使用 Nutch 仅索引具有某些 URL 的页面?
我想要 nutch 抓取 abc.com,但我只想索引 car.abc.com。 car.abc.com 链接可以位于 abc.com 中的任何级别。所以,基本上,我希望 nutch 能够正常抓取 abc.com,但只索引以 car.abc.com 开头的页面。例如 car.abc.com/toyota...car.abc.com/honda...
我将 regex-urlfilter.txt 设置为仅包含 car.abc.com 并运行命令“generatecrawl/crawldbcrawl/segments” ,但它只是说“生成器:选择用于提取的 0 条记录,正在退出...”。我猜 car.abc.com 链接仅存在于几个深度级别。
如何做到这一点? 谢谢。
I want nutch to crawl abc.com, but I want to index only car.abc.com. car.abc.com links can in any levels in abc.com. So, basically, I want nutch to keep crawl abc.com normally, but index only pages that start as car.abc.com. e.g. car.abc.com/toyota...car.abc.com/honda...
I set the regex-urlfilter.txt to include only car.abc.com and run the command "generate crawl/crawldb crawl/segments", but it just say "Generator: 0 records selected for fetching, exiting ..." . I guess car.abc.com links exist only in several levels deep.
How to do this?
Thanks.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
一种方法是使用 mergedb 命令的 -filter 开关。该命令采用爬网数据库作为输入,并创建一个新的爬网数据库,其中过滤了一些网址。只需使用过滤后的爬行数据库进行索引即可。
唯一的缺点是我还没有找到让 mergedb 命令使用除 regex-urlfilter.txt(生成器使用的文件)之外的其他文件的方法。您必须维护两个文件,例如 regex-urlfilter.txt:一个用于带有 abc.com 的生成器,另一个用于 mergedb 命令,该命令排除与 car.abc 不同的 URL。 com。但由于这两个命令都尝试加载相同的文件,因此在调用这两个命令之一之前,您必须将相应的文件重命名为 regex-urlfilter.txt。
如果有人知道配置 mergedb 命令以使用另一个文件的方法,我很高兴听到它!
One way is to use the -filter switch of the mergedb command. The command takes a crawl db as input and created a new crawl db with some urls filtered. Just use that filtered crawl db for indexing.
The only drawback to this is that I have not found a way for the mergedb command to use another file than regex-urlfilter.txt, which is the file used by the generator. You will have to maintain two files like regex-urlfilter.txt: one used for the generator with abc.com and another one used for the mergedb command that excludes urls not like car.abc.com. But since both command try to load the same file, you will have to rename the appropriate file to regex-urlfilter.txt before calling one of the two commands.
If someone knows a way to configure the mergedb command to use another file, I'd be happy to hear it!