如何使 Heritrix 继续对找到但不在种子列表中的域进行爬网过程

发布于 2024-12-08 08:29:51 字数 94 浏览 0 评论 0原文

如何使 Heritrix 继续对找到但不在种子列表中的域进行爬网过程?
我的意思是在爬过种子列表中的所有域后不要停止。并继续对在爬行过程中找到的每个链接进行爬行过程。

How to make Heritrix to continue crawl process on domains found and are not in seed list?
I mean make it to not to stop after crawled over all Domains in seeds list. and continue the crawling process for each link it found in the crawling process.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

辞慾 2024-12-15 08:29:51

自从我上次使用 Heritrix 以来已经有一段时间了,但如果我没记错的话,您需要更改设置/配置文件中的 max-link-hopsmax-link-hops 越大,Heritrix 从您定义的种子中生成的步数(“跳数”)就越多。

It's been a while since I last worked with Heritrix, but if memory serves me well, you'll need to change the max-link-hops in your settings/profile. The larger you make max-link-hops, the more steps ("hops") Heritrix makes from the seed(s) you have defined.

苍风燃霜 2024-12-15 08:29:51

默认情况下,Heritrix 配置为仅抓取种子列表中的域上的 URL。一些附加材料通常也会作为嵌入材料进行爬网,托管在其他地方,也会被获取。

如果您希望 Heritrix 抓取它遇到的任何内容,则需要修改范围。

范围由一系列决定规则组成。每条规则都可以接受、拒绝或传递 URL。最后一条接受或拒绝的规则获胜。通常,列表中的第一个规则是一揽子拒绝全部,然后是 SurtPrefixDecideRule,该规则在与 SURT 列表匹配的所有 URL 中进行规则。 SURT 列表通常是使用种子列表构建的。

但是,您可以手动配置 SURT 列表,指定您自己的列表,或者(如果您确实想要所有内容),您可以简单地删除它和拒绝所有规则,然后在顶部添加接受所有决定规则。

有关配置 Heritrix 3 作用域的详细信息。

By default Heritrix is configured to only crawl URLs on the domains that are in your seed list. Some additional material is also usually crawled as embedded material, hosted elsewhere, is also fetched.

If you would like Heritrix to crawl anything it comes across, you'll need to modify the scope.

The scope is composed of a series of decide rules. Each rule can ACCEPT, REJECT or pass on a URL. The last rule to either ACCEPT or REJECT wins. Typically, the first rule in the list is a blanket reject all, then followed by a SurtPrefixDecideRule that rules in all URLs that match the SURT list. The SURT list is typically built using the seed list.

You can however configure the SURT list manually be specifying your own, or (if you really want everything), you can simply remove it and the reject all rule and add an accept all decide rule to the top.

More on configuring Heritrix 3 scoping.

折戟 2024-12-15 08:29:51

您还可以将 surt 决定规则“NotonDomains”设置为 true。这将抓取不在种子列表中的所有域。

You can also set the surt decide rule 'NotonDomains' to true. This will crawl all domains which is not on the seedlist.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文