关于使用nutch抓取短网址

发布于 2024-10-14 09:02:52 字数 2495 浏览 2 评论 0原文

我正在为我的应用程序使用 nutch 爬虫,该应用程序需要抓取一组我提供给 urls 目录的 URL,并仅获取该 URL 的内容。 我对内部或外部链接的内容不感兴趣。 因此,我使用了 NUTCH 爬虫,并通过将深度指定为 1 来运行爬行命令。Nutch

bin/nutch crawl urls -dir crawl -depth 1

爬行 URL 并为我提供给定 URL 的内容。

我正在使用readseg实用程序阅读内容。

bin/nutch readseg -dump crawl/segments/* arjun -nocontent -nofetch -nogenerate -noparse -noparsedata

这样我就可以获取网页的内容了。

我面临的问题是,如果我给出直接的网址

http://isoc.org/wp/worldipv6day/
http://openhackindia.eventbrite.com
http://www.urlesque.com/2010/06/11/last-shot-ye-olde-twitter/
http://www.readwriteweb.com/archives/place_your_tweets_with_twitter_locations.php
http://bangalore.yahoo.com/labs/summerschool.html
http://riadevcamp.eventbrite.com
http://www.sleepingtime.org/

,那么我就能够获取网页的内容。 但是,当我将 URL 集作为短 URL 提供时,

http://is.gd/jOoAa9
http://is.gd/ubHRAF
http://is.gd/GiFqj9
http://is.gd/H5rUhg
http://is.gd/wvKINL
http://is.gd/K6jTNl
http://is.gd/mpa6fr
http://is.gd/fmobvj
http://is.gd/s7uZf***

我无法获取内容。

当我阅读这些片段时,它没有显示任何内容。请在下面找到从段读取的转储文件的内容。

*Recno:: 0
URL:: http://is.gd/0yKjO6
CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Jan 25 20:56:07 IST 2011
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1295969171407
Content::
Version: -1
url: http://is.gd/0yKjO6
base: http://is.gd/0yKjO6
contentType: text/html
metadata: Date=Tue, 25 Jan 2011 15:26:28 GMT nutch.crawl.score=1.0 Location=http://holykaw.alltop.com/the-twitter-cool-of-a-to-z?tu4=1 _fst_=36 nutch.segment.name=20110125205614 Content-Type=text/html; charset=UTF-8 Connection=close Server=nginx X-Powered-By=PHP/5.2.14
Content:
Recno:: 1
URL:: http://is.gd/1tpKaN
Content::
Version: -1
url: http://is.gd/1tpKaN
base: http://is.gd/1tpKaN
contentType: text/html
metadata: Date=Tue, 25 Jan 2011 15:26:28 GMT nutch.crawl.score=1.0 Location=http://holykaw.alltop.com/fighting-for-women-who-dont-want-a-voice?tu3=1 _fst_=36 nutch.segment.name=20110125205614 Content-Type=text/html; charset=UTF-8 Connection=close Server=nginx X-Powered-By=PHP/5.2.14
Content:
CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Jan 25 20:56:07 IST 2011
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0*

我还尝试将nutch-default.xml 中的max.redirects 属性设置为4,但没有找到任何进展。 请为我提供这个问题的解决方案。

谢谢和问候, 阿琼·库马尔·雷迪

I am using nutch crawler for my application which needs to crawl a set of URLs which I give to the urls directory and fetch only the contents of that URL only.
I am not interested in the contents of the internal or external links.
So I have used NUTCH crawler and have run the crawl command by giving depth as 1.

bin/nutch crawl urls -dir crawl -depth 1

Nutch crawls the urls and gives me the contents of the given urls.

I am reading the content using readseg utility.

bin/nutch readseg -dump crawl/segments/* arjun -nocontent -nofetch -nogenerate -noparse -noparsedata

With this I am fetching the content of webpage.

The problem I am facing is if I give direct urls like

http://isoc.org/wp/worldipv6day/
http://openhackindia.eventbrite.com
http://www.urlesque.com/2010/06/11/last-shot-ye-olde-twitter/
http://www.readwriteweb.com/archives/place_your_tweets_with_twitter_locations.php
http://bangalore.yahoo.com/labs/summerschool.html
http://riadevcamp.eventbrite.com
http://www.sleepingtime.org/

then I am able to get the contents of the webpage.
But when I give the set of URLs as short URLs like

http://is.gd/jOoAa9
http://is.gd/ubHRAF
http://is.gd/GiFqj9
http://is.gd/H5rUhg
http://is.gd/wvKINL
http://is.gd/K6jTNl
http://is.gd/mpa6fr
http://is.gd/fmobvj
http://is.gd/s7uZf***

I am not able to fetch the contents.

When I read the segments, it is not showing any content. Please find below the content of dump file read from segments.

*Recno:: 0
URL:: http://is.gd/0yKjO6
CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Jan 25 20:56:07 IST 2011
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: _ngt_: 1295969171407
Content::
Version: -1
url: http://is.gd/0yKjO6
base: http://is.gd/0yKjO6
contentType: text/html
metadata: Date=Tue, 25 Jan 2011 15:26:28 GMT nutch.crawl.score=1.0 Location=http://holykaw.alltop.com/the-twitter-cool-of-a-to-z?tu4=1 _fst_=36 nutch.segment.name=20110125205614 Content-Type=text/html; charset=UTF-8 Connection=close Server=nginx X-Powered-By=PHP/5.2.14
Content:
Recno:: 1
URL:: http://is.gd/1tpKaN
Content::
Version: -1
url: http://is.gd/1tpKaN
base: http://is.gd/1tpKaN
contentType: text/html
metadata: Date=Tue, 25 Jan 2011 15:26:28 GMT nutch.crawl.score=1.0 Location=http://holykaw.alltop.com/fighting-for-women-who-dont-want-a-voice?tu3=1 _fst_=36 nutch.segment.name=20110125205614 Content-Type=text/html; charset=UTF-8 Connection=close Server=nginx X-Powered-By=PHP/5.2.14
Content:
CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Tue Jan 25 20:56:07 IST 2011
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0*

I have also tried by setting the max.redirects property in nutch-default.xml as 4 but dint find any progress.
Kindly provide me a solution for this problem.

Thanks and regards,
Arjun Kumar Reddy

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

沉默的熊 2024-10-21 09:02:52

使用nutch 1.2尝试编辑文件conf/nutch-default.xml
找到 http.redirect.max 并将值至少更改为 1 而不是默认的 0。

<property>
  <name>http.redirect.max</name>
  <value>2</value><!-- instead of 0 -->
  <description>The maximum number of redirects the fetcher will follow when
  trying to fetch a page. If set to negative or 0, fetcher won't immediately
  follow redirected URLs, instead it will record them for later fetching.
  </description>
</property>

祝你好运

Using nutch 1.2 try editing the file conf/nutch-default.xml
find http.redirect.max and change the value to at least 1 instead of the default 0.

<property>
  <name>http.redirect.max</name>
  <value>2</value><!-- instead of 0 -->
  <description>The maximum number of redirects the fetcher will follow when
  trying to fetch a page. If set to negative or 0, fetcher won't immediately
  follow redirected URLs, instead it will record them for later fetching.
  </description>
</property>

Good luck

只是偏爱你 2024-10-21 09:02:52

您必须将深度设置为 2 或更大,因为第一次获取会返回 301(或 302)代码。重定向将在下一次迭代中进行,因此您必须允许更多的深度。

另外,请确保您允许 regex-urlfilter.txt 中遵循的所有网址

You will have to set a depth of 2 or more, because the first fetch returns a 301 (or 302) code. The redirection will be followed on the next iteration, so you have to allow more depth.

Also, make sure that you allow all urls that will be followed in your regex-urlfilter.txt

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文