关于使用nutch抓取短网址

发布于 2024-10-14 09:02:52 字数 2495 浏览 6 评论 0原文

我正在为我的应用程序使用 nutch 爬虫，该应用程序需要抓取一组我提供给 urls 目录的 URL，并仅获取该 URL 的内容。我对内部或外部链接的内容不感兴趣。因此，我使用了 NUTCH 爬虫，并通过将深度指定为 1 来运行爬行命令。Nutch

bin/nutch crawl urls -dir crawl -depth 1

爬行 URL 并为我提供给定 URL 的内容。

我正在使用readseg实用程序阅读内容。

bin/nutch readseg -dump crawl/segments/* arjun -nocontent -nofetch -nogenerate -noparse -noparsedata

这样我就可以获取网页的内容了。

我面临的问题是，如果我给出直接的网址

http://isoc.org/wp/worldipv6day/
http://openhackindia.eventbrite.com
http://www.urlesque.com/2010/06/11/last-shot-ye-olde-twitter/
http://www.readwriteweb.com/archives/place_your_tweets_with_twitter_locations.php
http://bangalore.yahoo.com/labs/summerschool.html
http://riadevcamp.eventbrite.com
http://www.sleepingtime.org/

，那么我就能够获取网页的内容。但是，当我将 URL 集作为短 URL 提供时，

http://is.gd/jOoAa9
http://is.gd/ubHRAF
http://is.gd/GiFqj9
http://is.gd/H5rUhg
http://is.gd/wvKINL
http://is.gd/K6jTNl
http://is.gd/mpa6fr
http://is.gd/fmobvj
http://is.gd/s7uZf***

我无法获取内容。

当我阅读这些片段时，它没有显示任何内容。请在下面找到从段读取的转储文件的内容。

*Recno:: 0
URL:: http://is.gd/0yKjO6
CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Jan 25 20:56:07 IST 2011
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1295969171407
Content::
Version: -1
url: http://is.gd/0yKjO6
base: http://is.gd/0yKjO6
contentType: text/html
metadata: Date=Tue, 25 Jan 2011 15:26:28 GMT nutch.crawl.score=1.0 Location=http://holykaw.alltop.com/the-twitter-cool-of-a-to-z?tu4=1 _fst_=36 nutch.segment.name=20110125205614 Content-Type=text/html; charset=UTF-8 Connection=close Server=nginx X-Powered-By=PHP/5.2.14
Content:
Recno:: 1
URL:: http://is.gd/1tpKaN
Content::
Version: -1
url: http://is.gd/1tpKaN
base: http://is.gd/1tpKaN
contentType: text/html
metadata: Date=Tue, 25 Jan 2011 15:26:28 GMT nutch.crawl.score=1.0 Location=http://holykaw.alltop.com/fighting-for-women-who-dont-want-a-voice?tu3=1 _fst_=36 nutch.segment.name=20110125205614 Content-Type=text/html; charset=UTF-8 Connection=close Server=nginx X-Powered-By=PHP/5.2.14
Content:
CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Jan 25 20:56:07 IST 2011
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0*

我还尝试将nutch-default.xml 中的max.redirects 属性设置为4，但没有找到任何进展。请为我提供这个问题的解决方案。

谢谢和问候，阿琼·库马尔·雷迪

原文

I am using nutch crawler for my application which needs to crawl a set of URLs which I give to the urls directory and fetch only the contents of that URL only.
I am not interested in the contents of the internal or external links.
So I have used NUTCH crawler and have run the crawl command by giving depth as 1.

bin/nutch crawl urls -dir crawl -depth 1

Nutch crawls the urls and gives me the contents of the given urls.

I am reading the content using readseg utility.

bin/nutch readseg -dump crawl/segments/* arjun -nocontent -nofetch -nogenerate -noparse -noparsedata

With this I am fetching the content of webpage.

The problem I am facing is if I give direct urls like

http://isoc.org/wp/worldipv6day/
http://openhackindia.eventbrite.com
http://www.urlesque.com/2010/06/11/last-shot-ye-olde-twitter/
http://www.readwriteweb.com/archives/place_your_tweets_with_twitter_locations.php
http://bangalore.yahoo.com/labs/summerschool.html
http://riadevcamp.eventbrite.com
http://www.sleepingtime.org/

then I am able to get the contents of the webpage.
But when I give the set of URLs as short URLs like

http://is.gd/jOoAa9
http://is.gd/ubHRAF
http://is.gd/GiFqj9
http://is.gd/H5rUhg
http://is.gd/wvKINL
http://is.gd/K6jTNl
http://is.gd/mpa6fr
http://is.gd/fmobvj
http://is.gd/s7uZf***

I am not able to fetch the contents.

When I read the segments, it is not showing any content. Please find below the content of dump file read from segments.

*Recno:: 0
URL:: http://is.gd/0yKjO6
CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Jan 25 20:56:07 IST 2011
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1295969171407
Content::
Version: -1
url: http://is.gd/0yKjO6
base: http://is.gd/0yKjO6
contentType: text/html
metadata: Date=Tue, 25 Jan 2011 15:26:28 GMT nutch.crawl.score=1.0 Location=http://holykaw.alltop.com/the-twitter-cool-of-a-to-z?tu4=1 _fst_=36 nutch.segment.name=20110125205614 Content-Type=text/html; charset=UTF-8 Connection=close Server=nginx X-Powered-By=PHP/5.2.14
Content:
Recno:: 1
URL:: http://is.gd/1tpKaN
Content::
Version: -1
url: http://is.gd/1tpKaN
base: http://is.gd/1tpKaN
contentType: text/html
metadata: Date=Tue, 25 Jan 2011 15:26:28 GMT nutch.crawl.score=1.0 Location=http://holykaw.alltop.com/fighting-for-women-who-dont-want-a-voice?tu3=1 _fst_=36 nutch.segment.name=20110125205614 Content-Type=text/html; charset=UTF-8 Connection=close Server=nginx X-Powered-By=PHP/5.2.14
Content:
CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Jan 25 20:56:07 IST 2011
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0*

I have also tried by setting the max.redirects property in nutch-default.xml as 4 but dint find any progress.
Kindly provide me a solution for this problem.

Thanks and regards,
Arjun Kumar Reddy

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

沉默的熊 2024-10-21 09:02:52

使用nutch 1.2尝试编辑文件conf/nutch-default.xml
找到 http.redirect.max 并将值至少更改为 1 而不是默认的 0。

<property>
  <name>http.redirect.max</name>
  <value>2</value><!-- instead of 0 -->
  <description>The maximum number of redirects the fetcher will follow when
  trying to fetch a page. If set to negative or 0, fetcher won't immediately
  follow redirected URLs, instead it will record them for later fetching.
  </description>
</property>

祝你好运

Using nutch 1.2 try editing the file conf/nutch-default.xml
find http.redirect.max and change the value to at least 1 instead of the default 0.

<property>
  <name>http.redirect.max</name>
  <value>2</value><!-- instead of 0 -->
  <description>The maximum number of redirects the fetcher will follow when
  trying to fetch a page. If set to negative or 0, fetcher won't immediately
  follow redirected URLs, instead it will record them for later fetching.
  </description>
</property>

Good luck

回复收藏 0 原文