使用Nutch爬取指定URL列表

发布于 2025-01-02 12:52:32 字数 164 浏览 2 评论 0原文

我有一百万个 URL 列表需要获取。我使用这个列表作为 nutch 种子，并使用 Nutch 的基本 crawl 命令来获取它们。但是，我发现 Nutch 会自动获取不在列表中的 URL。我确实将爬网参数设置为-深度1 -topN 1000000。但它不起作用。有谁知道该怎么做？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

走过海棠暮 2025-01-09 12:52:32

在 nutch-site.xml 中设置此属性。（默认情况下为 true，因此它将外链接添加到crawldb）

<property>
  <name>db.update.additions.allowed</name>
  <value>false</value>
  <description>If true, updatedb will add newly discovered URLs, if false
  only already existing URLs in the CrawlDb will be updated and no new
  URLs will be added.
  </description>
</property>

Set this property in nutch-site.xml. (by default its true so it adds outlinks to the crawldb)

<property>
  <name>db.update.additions.allowed</name>
  <value>false</value>
  <description>If true, updatedb will add newly discovered URLs, if false
  only already existing URLs in the CrawlDb will be updated and no new
  URLs will be added.
  </description>
</property>

回复收藏 0 原文

烟雨凡馨 2025-01-09 12:52:32

删除爬网和 urls 目录（如果之前已创建）
创建并更新种子文件（其中列出 URL 每行 1URL）
重新启动爬网过程

命令

nutch crawl urllist -dir crawl -depth 3 -topN 1000000

urllist - 种子文件（url 列表）所在的目录
crawl - 目录名称

即使问题仍然存在，请尝试删除您的 nutch 文件夹并重新启动整个过程。

Delete the crawl and urls directory (if created before)
Create and Update the seed file ( where URLs are listed 1URL per row)
Restart the crawling process

Command

nutch crawl urllist -dir crawl -depth 3 -topN 1000000

urllist - Directory where seed file (url list) is present
crawl - Directory name

Even if the problem persists, try to delete your nutch folder and restart the whole process.

回复收藏 0 原文

~没有更多了~

关于作者

辞别

暂无简介

文章

545 人气

关注发私信

alipaysp_snBf0MSZIv

文章 0 评论 0

关注

梦断已成空

文章 0 评论 0

关注

瞎闹

文章 0 评论 0

关注

凯凯我们等你回来

文章 0 评论 0

关注

寄意

文章 0 评论 0

关注

似梦非梦

文章 0 评论 0

友情链接

文江博客

使用Nutch爬取指定URL列表

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

使用Nutch爬取指定URL列表

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

alipaysp_snBf0MSZIv

梦断已成空

瞎闹

凯凯我们等你回来

寄意

似梦非梦

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。