从 nutch 中获取链接

发布于 2024-12-04 10:04:47 字数 286 浏览 1 评论 0原文

我正在使用 nutch 1.3 来抓取网站。我想要获取已爬网的网址列表以及源自页面的网址。

我使用 readdb 命令获取爬网的网址列表。

bin/nutch readdb crawl/crawldb -dump file

有没有办法通过读取crawldb或linkdb来找出页面上的url？

在 org.apache.nutch.parse.html.HtmlParser 中，我看到 outlinks 数组，我想知道是否有一种快速方法可以从命令行访问它。

原文

I am using nutch 1.3 to crawl a website. I want to get a list of urls crawled, and urls originating from a page.

I get list of urls crawled using readdb command.

bin/nutch readdb crawl/crawldb -dump file

Is there a way to find out urls that are on a page by reading crawldb or linkdb ?

in the org.apache.nutch.parse.html.HtmlParser I see outlinks array, I am wondering if there is a quick way to access it from command line.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

心凉怎暖 2024-12-11 10:04:47

在命令行中，您可以使用带有 -dump 或 -get 选项的 readseg 来查看外链。例如，

bin/nutch readseg -dump crawl/segments/20110919084424/ outputdir2 -nocontent -nofetch - nogenerate -noparse -noparsetext

less outputdir2/dump

From command line, you can see the outlinks by using readseg with -dump or -get option. For example,

bin/nutch readseg -dump crawl/segments/20110919084424/ outputdir2 -nocontent -nofetch - nogenerate -noparse -noparsetext

less outputdir2/dump

回复收藏 0 原文

爱冒险 2024-12-11 10:04:47

您可以使用 readlinkdb 命令轻松完成此操作。它为您提供了 URL 的所有内链和外链。

bin/nutch readlinkdb <linkdb> (-dump <out_dir> | -url <url>)

linkdb：这是我们希望从中读取和获取信息的linkdb目录。

out_dir：此参数将整个 linkdb 转储到我们希望指定的任何 out_dir 中的文本文件。

url：-url 参数为我们提供了有关特定 url 的信息。这被写入 System.out。

e.g. 

bin/nutch readlinkdb crawl/linkdb -dump myoutput/out1

欲了解更多信息，请参阅
http://wiki.apache.org/nutch/bin/nutch%20readlinkdb

You can easily do this with readlinkdb command. It gives you all the inlinks and outlinks to and from a url.

bin/nutch readlinkdb <linkdb> (-dump <out_dir> | -url <url>)

linkdb: This is the linkdb directory we wish to read and obtain information from.

out_dir: This parameter dumps the whole linkdb to a text file in any out_dir we wish to specify.

url: The -url arguement provides us with information about a specific url. This is written to System.out.

e.g. 

bin/nutch readlinkdb crawl/linkdb -dump myoutput/out1

For more information refer to
http://wiki.apache.org/nutch/bin/nutch%20readlinkdb

回复收藏 0 原文

~没有更多了~

关于作者

_失温

暂无简介

0 文章

0 评论

24 人气

关注发私信

友情链接

文江博客

从 nutch 中获取链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

从 nutch 中获取链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

关于作者

相关话题

热门标签

推荐作者

胡图图

zt006

z祗昰~

冰葑

野の

天空

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。