Nutch 并将爬网数据保存到 Amazon S3

发布于 2024-12-04 07:58:28 字数 642 浏览 1 评论 0原文

我正在尝试评估 Nutch/Solr/Hadoop 是否适合我的任务。

PS:之前我尝试将Nutch(1.4)和Hadoop集成起来,看看它是如何工作的。

这是我总体上想要实现的目标, a) 从种子 URL 开始,抓取并解析/保存数据/链接 --Nutch 爬虫无论如何都会这样做。

b) 然后能够从Java客户端查询爬取的索引 ---(可能使用 SolrJ 客户端)

c) 由于 Nutch(从 1.4.x 开始)已经在内部使用 Hadoop。我将只安装 Hadoop 并在 nutch-**.xml 中进行配置

d) 我希望 Nutch 将爬网索引保存到 Amazon S3,并且 Hadoop 使用 S3 作为文件系统。 这可能吗?甚至值得吗?

e) 我在一个论坛上读到,在 Nutch 2.0 中,有一个使用 GORA 的数据层可以将索引保存到 HBase 等。当 2.0 发布时我不知道。 :-( 有没有人建议获取 2.0“inprogress”主干并开始使用它,希望迟早获得发布的库?

PS:我仍在试图弄清楚 Nutch 在内部如何/何时/为何/何处使用 Hadoop。我只是找不到任何书面文档或教程。在这方面的任何帮助也非常感谢。

如果您正在阅读这一行,那么非常感谢您到目前为止阅读这篇文章:-)

I am trying to evaluate if Nutch/Solr/Hadoop are the right technologies for my task.

PS: Previously I was trying to integrate Nutch (1.4) and Hadoop to see how it works.

Here is what I am trying to achieve overall,
a) Start with a Seed URL(s) and crawl and parse/save data/links
--Which Nutch crawler does anyway.

b) Then be able to query the crawled indexes from a Java client
--- (may be either using SolrJ client)

c) Since Nutch (as of 1.4.x) already uses Hadoop internally. I will just install Hadoop and configure in the nutch-**.xml

d) I would like Nutch to save the crawled indexes to Amazon S3 and also Hadoop to use S3 as file system.
Is this even possible? or even worth it?

e) I read in one of the forums, that in Nutch 2.0, there is a data layer using GORA that can save indexes to HBase etc. I don't when 2.0 release is due. :-(
Does anyone suggest to grab 2.0 "inprogress" trunk and start using it, hoping to get a released lib sooner or later?

PS: I am still trying to figure out how/when/why/where Nutch uses Hadoop internally. I just cannot find any written documentation or tutorials..Any help on this aspect is also much appreciated.

If you are reading this line, then thank you so much for reading this post up to this point :-)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

吝吻 2024-12-11 07:58:29

Hadoop 本身可以使用 S3 作为其底层文件系统。当我在 EC2 中运行 Hadoop(使用 EMR 或您自己的/第三方 Hadoop AMI)时,我通过这种方法取得了非常好的结果。在 EC2 之外使用 Hadoop 时,我不建议使用 S3 作为底层文件系统,因为带宽限制可能会抵消 Hadoop 给您带来的任何性能提升。 Hadoop 的 S3 适配器由 Amazon 开发,是 Hadoop 核心的一部分。 Hadoop 像对待 HDFS 一样对待 S3。有关使用 Hadoop 的更多信息,请参阅 http://wiki.apache.org/hadoop/AmazonS3 S3。

Nutch 被设计为在 Hadoop 集群上作为作业运行(处于“部署”模式时),因此在其发行版中不包含 Hadoop jar。然而,由于它作为 Hadoop 作业运行,因此它可以访问 Hadoop 支持的任何底层数据存储,例如 HDFS 或 S3。当在“本地”模式下运行时,您将提供自己的本地 Hadoop 安装。一旦在“部署”模式下爬取完成,数据将存储在分布式文件系统中。出于性能考虑,建议您等待索引完成后将索引下载到本地机器上进行搜索,而不是在 DFS 中搜索。有关将 Nutch 与 Hadoop 结合使用的更多信息,请参阅 http://wiki.apache.org/nutch/NutchHadoopTutorial

关于 HBase,我有很好的使用经验,尽管不适用于您的特定用例。我可以想象,对于随机搜索,Solr 可能比 HBase 更快、功能更丰富,但这是值得商榷的。 HBase 可能值得一试。在 2.0 发布之前,您可能想编写自己的 Nutch-to-HBase 连接器,或者暂时坚持使用 Solr。

Hadoop can use S3 as its underlying file system natively. I have had very good results with this approach when running Hadoop in EC2, either using EMR or your own / third-party Hadoop AMIs. I would not recommend using S3 as the underlying file system when using Hadoop outside of EC2, as bandwidth limitations would likely negate any performance gains Hadoop would give you. The S3 adapter for Hadoop was developed by Amazon and is part of the Hadoop core. Hadoop treats S3 just like HDFS. See http://wiki.apache.org/hadoop/AmazonS3 for more info on using Hadoop with S3.

Nutch is designed to run as a job on a Hadoop cluster (when in "deploy" mode) and therefore does not include the Hadoop jars in its distribution. Because it runs as a Hadoop job, however, it can access any underlying data store that Hadoop supports, such as HDFS or S3. When run in "local" mode, you will provide your own local Hadoop installation. Once crawling is finished in "deploy" mode, the data will be stored in the distributed file system. It is recommended that you wait for indexing to finish and then download the index to a local machine for searching, rather than searching in the DFS, for performance reasons. For more on using Nutch with Hadoop, see http://wiki.apache.org/nutch/NutchHadoopTutorial.

Regarding HBase, I have had good experiences using it, although not for your particular use case. I can imagine that for random searches, Solr may be faster and more feature-rich than HBase, but this is debatable. HBase is probably worth a try. Until 2.0 comes out, you may want to write your own Nutch-to-HBase connector or simply stick with Solr for now.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文