关于使用nutch抓取短网址
我正在为我的应用程序使用 nutch 爬虫,该应用程序需要抓取一组我提供给 urls
目录的 URL,并仅获取该 URL 的内容。 我对内部或外部链接的内容不感兴趣。 因此,我使用了 NUTCH 爬虫,并通过将深度指定为 1 来运行爬行命令。Nutch
bin/nutch crawl urls -dir crawl -depth 1
爬行 URL 并为我提供给定 URL 的内容。
我正在使用readseg实用程序阅读内容。
bin/nutch readseg -dump crawl/segments/* arjun -nocontent -nofetch -nogenerate -noparse -noparsedata
这样我就可以获取网页的内容了。
我面临的问题是,如果我给出直接的网址
http://isoc.org/wp/worldipv6day/ http://openhackindia.eventbrite.com http://www.urlesque.com/2010/06/11/last-shot-ye-olde-twitter/ http://www.readwriteweb.com/archives/place_your_tweets_with_twitter_locations.php http://bangalore.yahoo.com/labs/summerschool.html http://riadevcamp.eventbrite.com http://www.sleepingtime.org/
,那么我就能够获取网页的内容。 但是,当我将 URL 集作为短 URL 提供时,
http://is.gd/jOoAa9 http://is.gd/ubHRAF http://is.gd/GiFqj9 http://is.gd/H5rUhg http://is.gd/wvKINL http://is.gd/K6jTNl http://is.gd/mpa6fr http://is.gd/fmobvj http://is.gd/s7uZf***
我无法获取内容。
当我阅读这些片段时,它没有显示任何内容。请在下面找到从段读取的转储文件的内容。
*Recno:: 0 URL:: http://is.gd/0yKjO6 CrawlDatum:: Version: 7 Status: 1 (db_unfetched) Fetch time: Tue Jan 25 20:56:07 IST 2011 Modified time: Thu Jan 01 05:30:00 IST 1970 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.0 Signature: null Metadata: _ngt_: 1295969171407 Content:: Version: -1 url: http://is.gd/0yKjO6 base: http://is.gd/0yKjO6 contentType: text/html metadata: Date=Tue, 25 Jan 2011 15:26:28 GMT nutch.crawl.score=1.0 Location=http://holykaw.alltop.com/the-twitter-cool-of-a-to-z?tu4=1 _fst_=36 nutch.segment.name=20110125205614 Content-Type=text/html; charset=UTF-8 Connection=close Server=nginx X-Powered-By=PHP/5.2.14 Content: Recno:: 1 URL:: http://is.gd/1tpKaN Content:: Version: -1 url: http://is.gd/1tpKaN base: http://is.gd/1tpKaN contentType: text/html metadata: Date=Tue, 25 Jan 2011 15:26:28 GMT nutch.crawl.score=1.0 Location=http://holykaw.alltop.com/fighting-for-women-who-dont-want-a-voice?tu3=1 _fst_=36 nutch.segment.name=20110125205614 Content-Type=text/html; charset=UTF-8 Connection=close Server=nginx X-Powered-By=PHP/5.2.14 Content: CrawlDatum:: Version: 7 Status: 1 (db_unfetched) Fetch time: Tue Jan 25 20:56:07 IST 2011 Modified time: Thu Jan 01 05:30:00 IST 1970 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.0*
我还尝试将nutch-default.xml 中的max.redirects 属性设置为4,但没有找到任何进展。 请为我提供这个问题的解决方案。
谢谢和问候, 阿琼·库马尔·雷迪
I am using nutch crawler for my application which needs to crawl a set of URLs which I give to the urls
directory and fetch only the contents of that URL only.
I am not interested in the contents of the internal or external links.
So I have used NUTCH crawler and have run the crawl command by giving depth as 1.
bin/nutch crawl urls -dir crawl -depth 1
Nutch crawls the urls and gives me the contents of the given urls.
I am reading the content using readseg utility.
bin/nutch readseg -dump crawl/segments/* arjun -nocontent -nofetch -nogenerate -noparse -noparsedata
With this I am fetching the content of webpage.
The problem I am facing is if I give direct urls like
http://isoc.org/wp/worldipv6day/ http://openhackindia.eventbrite.com http://www.urlesque.com/2010/06/11/last-shot-ye-olde-twitter/ http://www.readwriteweb.com/archives/place_your_tweets_with_twitter_locations.php http://bangalore.yahoo.com/labs/summerschool.html http://riadevcamp.eventbrite.com http://www.sleepingtime.org/
then I am able to get the contents of the webpage.
But when I give the set of URLs as short URLs like
http://is.gd/jOoAa9 http://is.gd/ubHRAF http://is.gd/GiFqj9 http://is.gd/H5rUhg http://is.gd/wvKINL http://is.gd/K6jTNl http://is.gd/mpa6fr http://is.gd/fmobvj http://is.gd/s7uZf***
I am not able to fetch the contents.
When I read the segments, it is not showing any content. Please find below the content of dump file read from segments.
*Recno:: 0 URL:: http://is.gd/0yKjO6 CrawlDatum:: Version: 7 Status: 1 (db_unfetched) Fetch time: Tue Jan 25 20:56:07 IST 2011 Modified time: Thu Jan 01 05:30:00 IST 1970 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.0 Signature: null Metadata: _ngt_: 1295969171407 Content:: Version: -1 url: http://is.gd/0yKjO6 base: http://is.gd/0yKjO6 contentType: text/html metadata: Date=Tue, 25 Jan 2011 15:26:28 GMT nutch.crawl.score=1.0 Location=http://holykaw.alltop.com/the-twitter-cool-of-a-to-z?tu4=1 _fst_=36 nutch.segment.name=20110125205614 Content-Type=text/html; charset=UTF-8 Connection=close Server=nginx X-Powered-By=PHP/5.2.14 Content: Recno:: 1 URL:: http://is.gd/1tpKaN Content:: Version: -1 url: http://is.gd/1tpKaN base: http://is.gd/1tpKaN contentType: text/html metadata: Date=Tue, 25 Jan 2011 15:26:28 GMT nutch.crawl.score=1.0 Location=http://holykaw.alltop.com/fighting-for-women-who-dont-want-a-voice?tu3=1 _fst_=36 nutch.segment.name=20110125205614 Content-Type=text/html; charset=UTF-8 Connection=close Server=nginx X-Powered-By=PHP/5.2.14 Content: CrawlDatum:: Version: 7 Status: 1 (db_unfetched) Fetch time: Tue Jan 25 20:56:07 IST 2011 Modified time: Thu Jan 01 05:30:00 IST 1970 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.0*
I have also tried by setting the max.redirects property in nutch-default.xml as 4 but dint find any progress.
Kindly provide me a solution for this problem.
Thanks and regards,
Arjun Kumar Reddy
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
使用nutch 1.2尝试编辑文件conf/nutch-default.xml
找到 http.redirect.max 并将值至少更改为 1 而不是默认的 0。
祝你好运
Using nutch 1.2 try editing the file conf/nutch-default.xml
find http.redirect.max and change the value to at least 1 instead of the default 0.
Good luck
您必须将深度设置为 2 或更大,因为第一次获取会返回 301(或 302)代码。重定向将在下一次迭代中进行,因此您必须允许更多的深度。
另外,请确保您允许 regex-urlfilter.txt 中遵循的所有网址
You will have to set a depth of 2 or more, because the first fetch returns a 301 (or 302) code. The redirection will be followed on the next iteration, so you have to allow more depth.
Also, make sure that you allow all urls that will be followed in your regex-urlfilter.txt